Prompt Engineering

Untitled design

Being a great prompt engineer doesn’t require coding experience. Creativity and persistence will benefit you greatly on your journey, however. Read on to learn some useful prompting techniques.

  1. Clearly communicate what content or information is most important.
  2. Structure the prompt: Start by defining its role, give context/input data, then provide the instruction.
  3. Use specific, varied examples to help the model narrow its focus and generate more accurate results.
  4. Use constraints to limit the scope of the model’s output. This can help avoid meandering away from the instructions into factual inaccuracies.
  5. Break down complex tasks into a sequence of simpler prompts.
  6. Instruct the model to evaluate or check its own responses before producing them. (“Make sure to limit your response to 3 sentences”, “Rate your work on a scale of 1-10 for conciseness”, “Do you think this is correct?”).

And perhaps most important:

Be creative! The more creative and open-minded you are, the better your results will be. LLMs and prompt engineering are still in their infancy, and evolving every day.

Direct prompting (also known as Zero-shot) is the simplest type of prompt. It provides no examples to the model, just the instruction. You can also phrase the instruction as a question, or give the model a “role,” as seen in the second example below.

Provide:

  1. Instruction
  2. Some context

Idea Generation:

Role Prompting:

Data Organization:

One-shot prompting shows the model one clear, descriptive example of what you’d like it to imitate.

Idea generation using one example:

Few- and multi-shot prompting shows the model more examples of what you want it to do. It works better than zero-shot for more complex tasks where pattern replication is wanted, or when you need the output to be structured in a specific way that is difficult to describe.

Few-shot sentiment classification:

When this prompt is run, the model’s response will be to classify ‘It doesn’t work’ as positive or negative, as shown in the examples.

Multi-shot emoji response predictor:

Same process here, but since the prompt is more complex, the model has been given more examples to emulate.

Chain of Thought (CoT) prompting encourages the LLM to explain its reasoning. Combine it with few-shot prompting to get better results on more complex tasks that require reasoning before a response.

Recalling the zero-shot prompting from earlier, this approach takes a zero-shot prompt and adds an instruction: “Let’s think step by step.” The LLM is able to generate a chain of thought from this instruction, and usually a more accurate answer as well. This is a great approach to getting LLMs to generate correct answers for things like word problems.

Learn to love the reality of rewriting prompts several (possibly dozens) of times. Here are a few ideas for refining prompts if you get stuck:

Note: These strategies may become less useful or necessary over time as models improve.

  1. Repeat key words, phrases, or ideas
  2. Specify your desired output format (CSV, JSON, etc.)
  3. Use all caps to stress important points or instructions. You can also try exaggerations or hyperbolic language; for example: “Your explanation should be absolutely impossible to misinterpret. Every single word must ooze clarity!”
  4. Use synonyms or alternate phrasing (e.g., instead of “Summarize,” try appending “tldr” to some input text). Swap in different words or phrases and document which ones work better and which are worse.
  5. Try the sandwich technique with long prompts: Add the same statement in different places.
  6. Use a prompt library for inspiration. Prompt Hero and this prompt gallery are two good places to start.

LLM Settings

When designing and testing prompts, you typically interact with the LLM via an API. You can configure a few parameters to get different results for your prompts. Tweaking these settings are important to improve reliability and desirability of responses and it takes a bit of experimentation to figure out the proper settings for your use cases. Below are the common settings you will come across when using different LLM providers:

Temperature – In short, the lower the temperature, the more deterministic the results in the sense that the highest probable next token is always picked. Increasing temperature could lead to more randomness, which encourages more diverse or creative outputs. You are essentially increasing the weights of the other possible tokens. In terms of application, you might want to use a lower temperature value for tasks like fact-based QA to encourage more factual and concise responses. For poem generation or other creative tasks, it might be beneficial to increase the temperature value.

Top P – A sampling technique with temperature, called nucleus sampling, where you can control how deterministic the model is. If you are looking for exact and factual answers keep this low. If you are looking for more diverse responses, increase to a higher value. If you use Top P it means that only the tokens comprising the top_p probability mass are considered for responses, so a low top_p value selects the most confident responses. This means that a high top_p value will enable the model to look at more possible words, including less likely ones, leading to more diverse outputs.

The general recommendation is to alter temperature or Top P but not both.

Max Length – You can manage the number of tokens the model generates by adjusting the max length. Specifying a max length helps you prevent long or irrelevant responses and control costs.

Stop Sequences – A stop sequence is a string that stops the model from generating tokens. Specifying stop sequences is another way to control the length and structure of the model’s response. For example, you can tell the model to generate lists that have no more than 10 items by adding “11” as a stop sequence.

Frequency Penalty – The frequency penalty applies a penalty on the next token proportional to how many times that token already appeared in the response and prompt. The higher the frequency penalty, the less likely a word will appear again. This setting reduces the repetition of words in the model’s response by giving tokens that appear more a higher penalty.

Presence Penalty – The presence penalty also applies a penalty on repeated tokens but, unlike the frequency penalty, the penalty is the same for all repeated tokens. A token that appears twice and a token that appears 10 times are penalized the same. This setting prevents the model from repeating phrases too often in its response. If you want the model to generate diverse or creative text, you might want to use a higher presence penalty. Or, if you need the model to stay focused, try using a lower presence penalty.

Similar to temperature and top_p, the general recommendation is to alter the frequency or presence penalty but not both.

Before starting with some basic examples, keep in mind that your results may vary depending on the version of LLM you use.

Basics of Prompting

Prompting an LLM

You can achieve a lot with simple prompts, but the quality of results depends on how much information you provide it and how well-crafted the prompt is. A prompt can contain information like the instruction or question you are passing to the model and include other details such as contextinputs, or examples. You can use these elements to instruct the model more effectively to improve the quality of results.

Let’s get started by going over a basic example of a simple prompt:

Prompt

Output:

Something to note is that when using the OpenAI chat models like gpt-3.5-turbo or gpt-4, you can structure your prompt using three different roles: systemuser, and assistant. The system message is not required but helps to set the overall behavior of the assistant. The example above only includes a user message which you can use to directly prompt the model. For simplicity, all of the examples, except when it’s explicitly mentioned, will use only the user message to prompt the gpt-3.5-turbo model. The assistant message in the example above corresponds to the model response. You can also define an assistant message to pass examples of the desired behavior you want. You can learn more about working with chat models here(opens in a new tab).

You can observe from the prompt example above that the language model responds with a sequence of tokens that make sense given the context "The sky is". The output might be unexpected or far from the task you want to accomplish. In fact, this basic example highlights the necessity to provide more context or instructions on what specifically you want to achieve with the system. This is what prompt engineering is all about.

Let’s try to improve it a bit:

Prompt:

Output:

Is that better? Well, with the prompt above you are instructing the model to complete the sentence so the result looks a lot better as it follows exactly what you told it to do (“complete the sentence”). This approach of designing effective prompts to instruct the model to perform a desired task is what’s referred to as prompt engineering in this guide.

The example above is a basic illustration of what’s possible with LLMs today. Today’s LLMs are able to perform all kinds of advanced tasks that range from text summarization to mathematical reasoning to code generation.

Prompt Formatting

You have tried a very simple prompt above. A standard prompt has the following format:

or

You can format this into a question answering (QA) format, which is standard in a lot of QA datasets, as follows:

When prompting like the above, it’s also referred to as zero-shot prompting, i.e., you are directly prompting the model for a response without any examples or demonstrations about the task you want it to achieve. Some large language models have the ability to perform zero-shot prompting but it depends on the complexity and knowledge of the task at hand and the tasks the model was trained to perform good on.

A concrete prompt example is as follows:

Prompt

With some of the more recent models you can skip the “Q:” part as it is implied and understood by the model as a question answering task based on how the sequence is composed. In other words, the prompt could be simplified as follows:

Prompt

Given the standard format above, one popular and effective technique to prompting is referred to as few-shot prompting where you provide exemplars (i.e., demonstrations). You can format few-shot prompts as follows:

The QA format version would look like this:

Keep in mind that it’s not required to use the QA format. The prompt format depends on the task at hand. For instance, you can perform a simple classification task and give exemplars that demonstrate the task as follows:

Prompt:

Output:

Few-shot prompts enable in-context learning, which is the ability of language models to learn tasks given a few demonstrations. We discuss zero-shot prompting and few-shot prompting more extensively in upcoming sections.

Elements of a Prompt

As we cover more and more examples and applications with prompt engineering, you will notice that certain elements make up a prompt.

A prompt contains any of the following elements:

Instruction – a specific task or instruction you want the model to perform

Context – external information or additional context that can steer the model to better responses

Input Data – the input or question that we are interested to find a response for

Output Indicator – the type or format of the output.

To demonstrate the prompt elements better, here is a simple prompt that aims to perform a text classification task:

Prompt

In the prompt example above, the instruction correspond to the classification task, “Classify the text into neutral, negative, or positive”. The input data corresponds to the “I think the food was okay.’ part, and the output indicator used is “Sentiment:”. Note that this basic example doesn’t use context but this can also be provided as part of the prompt. For instance, the context for this text classification prompt can be additional examples provided as part of the prompt to help the model better understand the task and steer the type of outputs that you expect.

You do not need all the four elements for a prompt and the format depends on the task at hand. We will touch on more concrete examples in upcoming guides.

Here are some tips to keep in mind while you are designing your prompts:

Start Simple

As you get started with designing prompts, you should keep in mind that it is really an iterative process that requires a lot of experimentation to get optimal results. Using a simple playground from OpenAI or Cohere is a good starting point.

You can start with simple prompts and keep adding more elements and context as you aim for better results. Iterating your prompt along the way is vital for this reason. As you read the guide, you will see many examples where specificity, simplicity, and conciseness will often give you better results.

When you have a big task that involves many different subtasks, you can try to break down the task into simpler subtasks and keep building up as you get better results. This avoids adding too much complexity to the prompt design process at the beginning.

The Instruction

You can design effective prompts for various simple tasks by using commands to instruct the model what you want to achieve, such as “Write”, “Classify”, “Summarize”, “Translate”, “Order”, etc.

Keep in mind that you also need to experiment a lot to see what works best. Try different instructions with different keywords, contexts, and data and see what works best for your particular use case and task. Usually, the more specific and relevant the context is to the task you are trying to perform, the better. We will touch on the importance of sampling and adding more context in the upcoming guides.

Others recommend that you place instructions at the beginning of the prompt. Another recommendation is to use some clear separator like “###” to separate the instruction and context.

For instance:

Prompt:

Output:

Specificity

Be very specific about the instruction and task you want the model to perform. The more descriptive and detailed the prompt is, the better the results. This is particularly important when you have a desired outcome or style of generation you are seeking. There aren’t specific tokens or keywords that lead to better results. It’s more important to have a good format and descriptive prompt. In fact, providing examples in the prompt is very effective to get desired output in specific formats.

When designing prompts, you should also keep in mind the length of the prompt as there are limitations regarding how long the prompt can be. Thinking about how specific and detailed you should be. Including too many unnecessary details is not necessarily a good approach. The details should be relevant and contribute to the task at hand. This is something you will need to experiment with a lot. We encourage a lot of experimentation and iteration to optimize prompts for your applications.

As an example, let’s try a simple prompt to extract specific information from a piece of text.

Prompt:

Output:

Avoid Impreciseness

Given the tips above about being detailed and improving format, it’s easy to fall into the trap of wanting to be too clever about prompts and potentially creating imprecise descriptions. It’s often better to be specific and direct. The analogy here is very similar to effective communication — the more direct, the more effective the message gets across.

For example, you might be interested in learning the concept of prompt engineering. You might try something like:

It’s not clear from the prompt above how many sentences to use and what style. You might still somewhat get good responses with the above prompts but the better prompt would be one that is very specific, concise, and to the point. Something like:

To do or not to do?

Another common tip when designing prompts is to avoid saying what not to do but say what to do instead. This encourages more specificity and focuses on the details that lead to good responses from the model.

Here is an example of a movie recommendation chatbot failing at exactly what I don’t want it to do because of how I wrote the instruction — focusing on what not to do.

Prompt:

Output:

Here is a better prompt:

Prompt:

Output:

Examples of Prompts

The previous section introduced a basic example of how to prompt LLMs.

This section will provide more examples of how to use prompts to achieve different tasks and introduce key concepts along the way. Often, the best way to learn concepts is by going through examples. The few examples below illustrate how you can use well-crafted prompts to perform different types of tasks.

Text Summarization

One of the standard tasks in natural language generation is text summarization. Text summarization can include many different flavors and domains. In fact, one of the most promising applications of language models is the ability to summarize articles and concepts into quick and easy-to-read summaries. Let’s try a basic summarization task using prompts.

Let’s say you are interested to learn about antibiotics, you could try a prompt like this:

Prompt:

Output:

The “A:” is an explicit prompt format that you use in question answering. You used it here to tell the model that there is an answer expected further. In this example, it’s not clear how this is useful vs not using it but we will leave it that for later examples. Let’s just assume that this is too much information and you want to summarize it further. In fact, you can instruct the model to summarize into one sentence like so:

Prompt:

Output:

Without paying too much attention to the accuracy of the output above, which is something we will touch on in a later guide, the model tried to summarize the paragraph in one sentence. You can get clever with the instructions but we will leave that for a later chapter. Feel free to pause here and experiment to see if you get better results.


Information Extraction

While language models are trained to perform natural language generation and related tasks, it’s also very capable of performing classification and a range of other natural language processing (NLP) tasks.

Here is an example of a prompt that extracts information from a given paragraph.

Prompt:

Output:

There are many ways you can improve the results above, but this is already very useful.

By now it should be obvious that you can ask the model to perform different tasks by simply instructing it what to do. That’s a powerful capability that AI product developers are already using to build powerful products and experiences.

Paragraph source: ChatGPT: five priorities for research(opens in a new tab)


Question Answering

One of the best ways to get the model to respond with specific answers is to improve the format of the prompt. As covered before, a prompt could combine instructions, context, input, and output indicators to get improved results. While these components are not required, it becomes a good practice as the more specific you are with instruction, the better results you will get. Below is an example of how this would look following a more structured prompt.

Prompt:

Output:

Context obtained from Nature(opens in a new tab).


Text Classification

So far, you have used simple instructions to perform a task. As a prompt engineer, you need to get better at providing better instructions. But that’s not all! You will also find that for harder use cases, just providing instructions won’t be enough. This is where you need to think more about the context and the different elements you can use in a prompt. Other elements you can provide are input data or examples.

Let’s try to demonstrate this by providing an example of text classification.

Prompt:

Output:

You gave the instruction to classify the text and the model responded with 'Neutral', which is correct. Nothing is wrong with this but let’s say that what you really need is for the model to give the label in the exact format you want. So instead of Neutral, you want it to return neutral. How do you achieve this? There are different ways to do this. You care about specificity here, so the more information you can provide the prompt, the better results. You can try providing examples to specify the correct behavior. Let’s try again:

Prompt:

Output:

Perfect! This time the model returned neutral which is the specific label you were looking for. It seems that the example provided in the prompt helped the model to be specific in its output.

To highlight why sometimes being specific is important, check out the example below and spot the problem:

Prompt:

Output:

What is the problem here? As a hint, the made-up nutral label is completely ignored by the model. Instead, the model outputs Neutral as it has some bias towards that label. But let’s assume that what you really want is nutral. How would you fix this? Maybe you can try adding descriptions to the labels or add more examples to the prompt? If you are not sure, we will discuss a few ideas in the upcoming sections.


Conversation

Perhaps one of the more interesting things you can achieve with prompt engineering is instructing the LLM system on how to behave, its intent, and its identity. This is particularly useful when you are building conversational systems like customer service chatbots.

For instance, let’s create a conversational system that’s able to generate more technical and scientific responses to questions. Note how you are explicitly telling it how to behave through the instruction. This is sometimes referred to as role prompting.

Prompt:

Output:

Our AI research assistant sounds a bit too technical, right? Okay, let’s change this behavior and instruct the system to give more accessible answers.

Prompt:

Output:

You made some progress. You can continue improving it. If you add more examples, you might get even better results.


Code Generation

One application where LLMs are quite effective is code generation. Copilot is a great example of this. There are a vast number of code-generation tasks you can perform with clever prompts. Let’s look at a few examples below.

First, let’s try a simple program that greets the user.

Prompt:

Output:

Notice that you didn’t even need to specify the language to use.

Let’s switch levels a bit. The example below shows how powerful LLMs can be with a little more effort in designing the prompts.

Prompt:

Output:

This is very impressive. In this case, you provided data about the database schema and asked it to generate a valid MySQL query.


Reasoning

Perhaps one of the most difficult tasks for an LLM today is one that requires some form of reasoning. Reasoning is one of most interesting areas due to the types of complex applications that can emerge from LLMs.

There have been some improvements in tasks involving mathematical capabilities. That said, it’s important to note that current LLMs struggle to perform reasoning tasks so this requires even more advanced prompt engineering techniques. We will cover these advanced techniques in the next guide. For now, we will cover a few basic examples to show arithmetic capabilities.

Prompt:

Output:

Let’s try something more difficult.

Prompt:

Output

That’s incorrect! Let’s try to improve this by improving the prompt.

Prompt:

Output:

Much better, right? By the way, we tried this task a couple of times and the model sometimes fails. If you provide better instructions combined with examples, it might help get more accurate results.

In the upcoming section, we will cover even more advanced prompt engineering concepts and techniques for improving performance on all these and more difficult tasks.

Zero-Shot Prompting

Large language models (LLMs) today, such as GPT-3.5 Turbo, GPT-4, and Claude 3, are tuned to follow instructions and are trained on large amounts of data. Large-scale training makes these models capable of performing some tasks in a “zero-shot” manner. Zero-shot prompting means that the prompt used to interact with the model won’t contain examples or demonstrations. The zero-shot prompt directly instructs the model to perform a task without any additional examples to steer it.

We tried a few zero-shot examples in the previous section. Here is one of the examples (ie., text classification) we used:

Prompt:

Output:

Note that in the prompt above we didn’t provide the model with any examples of text alongside their classifications, the LLM already understands “sentiment” — that’s the zero-shot capabilities at work.

Instruction tuning has been shown to improve zero-shot learning Wei et al. (2022)(opens in a new tab). Instruction tuning is essentially the concept of finetuning models on datasets described via instructions. Furthermore, RLHF(opens in a new tab) (reinforcement learning from human feedback) has been adopted to scale instruction tuning wherein the model is aligned to better fit human preferences. This recent development powers models like ChatGPT. We will discuss all these approaches and methods in upcoming sections.

When zero-shot doesn’t work, it’s recommended to provide demonstrations or examples in the prompt which leads to few-shot prompting. In the next section, we demonstrate few-shot prompting.

Few-Shot Prompting

While large-language models demonstrate remarkable zero-shot capabilities, they still fall short on more complex tasks when using the zero-shot setting. Few-shot prompting can be used as a technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response.

According to Touvron et al. 2023(opens in a new tab) few shot properties first appeared when models were scaled to a sufficient size (Kaplan et al., 2020)(opens in a new tab).

Let’s demonstrate few-shot prompting via an example that was presented in Brown et al. 2020(opens in a new tab). In the example, the task is to correctly use a new word in a sentence.

Prompt:

Output:

We can observe that the model has somehow learned how to perform the task by providing it with just one example (i.e., 1-shot). For more difficult tasks, we can experiment with increasing the demonstrations (e.g., 3-shot, 5-shot, 10-shot, etc.).

Following the findings from Min et al. (2022)(opens in a new tab), here are a few more tips about demonstrations/exemplars when doing few-shot:

  • “the label space and the distribution of the input text specified by the demonstrations are both important (regardless of whether the labels are correct for individual inputs)”
  • the format you use also plays a key role in performance, even if you just use random labels, this is much better than no labels at all.
  • additional results show that selecting random labels from a true distribution of labels (instead of a uniform distribution) also helps.

Let’s try out a few examples. Let’s first try an example with random labels (meaning the labels Negative and Positive are randomly assigned to the inputs):

Prompt:

Output:

We still get the correct answer, even though the labels have been randomized. Note that we also kept the format, which helps too. In fact, with further experimentation, it seems the newer GPT models we are experimenting with are becoming more robust to even random formats. Example:

Prompt:

Output:

There is no consistency in the format above but the model still predicted the correct label. We have to conduct a more thorough analysis to confirm if this holds for different and more complex tasks, including different variations of prompts.

Limitations of Few-shot Prompting

Standard few-shot prompting works well for many tasks but is still not a perfect technique, especially when dealing with more complex reasoning tasks. Let’s demonstrate why this is the case. Do you recall the previous example where we provided the following task:

If we try this again, the model outputs the following:

This is not the correct response, which not only highlights the limitations of these systems but that there is a need for more advanced prompt engineering.

Let’s try to add some examples to see if few-shot prompting improves the results.

Prompt:

Output:

That didn’t work. It seems like few-shot prompting is not enough to get reliable responses for this type of reasoning problem. The example above provides basic information on the task. If you take a closer look, the type of task we have introduced involves a few more reasoning steps. In other words, it might help if we break the problem down into steps and demonstrate that to the model. More recently, chain-of-thought (CoT) prompting(opens in a new tab) has been popularized to address more complex arithmetic, commonsense, and symbolic reasoning tasks.

Overall, it seems that providing examples is useful for solving some tasks. When zero-shot prompting and few-shot prompting are not sufficient, it might mean that whatever was learned by the model isn’t enough to do well at the task. From here it is recommended to start thinking about fine-tuning your models or experimenting with more advanced prompting techniques. Up next we talk about one of the popular prompting techniques called chain-of-thought prompting which has gained a lot of popularity.

Chain-of-Thought (CoT) Prompting

Image Source: Wei et al. (2022)(opens in a new tab)

Introduced in Wei et al. (2022)(opens in a new tab), chain-of-thought (CoT) prompting enables complex reasoning capabilities through intermediate reasoning steps. You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding.

Prompt:

Output:

Wow! We can see a perfect result when we provided the reasoning step. In fact, we can solve this task by providing even fewer examples, i.e., just one example seems enough:

Prompt:

Output:

Keep in mind that the authors claim that this is an emergent ability that arises with sufficiently large language models.

Zero-shot COT Prompting

Image Source: Kojima et al. (2022)(opens in a new tab)

One recent idea that came out more recently is the idea of zero-shot CoT(opens in a new tab) (Kojima et al. 2022) that essentially involves adding “Let’s think step by step” to the original prompt. Let’s try a simple problem and see how the model performs:

Prompt:

Output:

The answer is incorrect! Now Let’s try with the special prompt.

Prompt:

Output:

It’s impressive that this simple prompt is effective at this task. This is particularly useful where you don’t have too many examples to use in the prompt.

Automatic Chain-of-Thought (Auto-CoT)

When applying chain-of-thought prompting with demonstrations, the process involves hand-crafting effective and diverse examples. This manual effort could lead to suboptimal solutions. Zhang et al. (2022)(opens in a new tab) propose an approach to eliminate manual efforts by leveraging LLMs with “Let’s think step by step” prompt to generate reasoning chains for demonstrations one by one. This automatic process can still end up with mistakes in generated chains. To mitigate the effects of the mistakes, the diversity of demonstrations matter. This work proposes Auto-CoT, which samples questions with diversity and generates reasoning chains to construct the demonstrations.

Auto-CoT consists of two main stages:

  • Stage 1): question clustering: partition questions of a given dataset into a few clusters
  • Stage 2): demonstration sampling: select a representative question from each cluster and generate its reasoning chain using Zero-Shot-CoT with simple heuristics

The simple heuristics could be length of questions (e.g., 60 tokens) and number of steps in rationale (e.g., 5 reasoning steps). This encourages the model to use simple and accurate demonstrations.

The process is illustrated below:

Image Source: Zhang et al. (2022)(opens in a new tab)

Code for Auto-CoT is available here(opens in a new tab).

Meta Prompting

Introduction

Meta Prompting is an advanced prompting technique that focuses on the structural and syntactical aspects of tasks and problems rather than their specific content details. This goal with meta prompting is to construct a more abstract, structured way of interacting with large language models (LLMs), emphasizing the form and pattern of information over traditional content-centric methods.

Key Characteristics

According to Zhang et al. (2024)(opens in a new tab), the key characteristics of meta prompting can be summarized as follows:

1. Structure-oriented: Prioritizes the format and pattern of problems and solutions over specific content.

2. Syntax-focused: Uses syntax as a guiding template for the expected response or solution.

3. Abstract examples: Employs abstracted examples as frameworks, illustrating the structure of problems and solutions without focusing on specific details.

4. Versatile: Applicable across various domains, capable of providing structured responses to a wide range of problems.

5. Categorical approach: Draws from type theory to emphasize the categorization and logical arrangement of components in a prompt.

Advantages over Few-Shot Prompting

Zhang et al., 2024(opens in a new tab) report that meta prompting and few-shot prompting are different in that it meta prompting focuses on a more structure-oriented approach as opposed to a content-driven approach which few-shot prompting emphasizes.

The following example obtained from Zhang et al. (2024)(opens in a new tab) demonstrates the difference between a structured meta prompt and a few-shot prompt for solving problems from the MATH benchmark:

The advantages of Meta Prompting over few-shot promoting include:

1. Token efficiency: Reduces the number of tokens required by focusing on structure rather than detailed content.

2. Fair comparison: Provides a more fair approach for comparing different problem-solving models by minimizing the influence of specific examples.

3. Zero-shot efficacy: Can be viewed as a form of zero-shot prompting, where the influence of specific examples is minimized.

Applications

By focusing on the structural patterns of problem-solving, Meta Prompting offers a clear roadmap for navigating complex topics, enhancing the reasoning capabilities of LLMs across various domains.

It’s important to note that meta prompting also assumes that the LLM has innate knowledge about the specific task or problem being addressed. As LLMs can generalize to a unseen tasks, it is possible that they can be leveraged with meta prompting but performance might deteriorate with more unique and novel tasks as is the case with zero-shot prompting.

Applications where meta prompting can be beneficial include but not limited to complex reasoning tasks, mathematical problem-solving, coding challenges, theoretical queries.

Self-Consistency

Perhaps one of the more advanced techniques out there for prompt engineering is self-consistency. Proposed by Wang et al. (2022)(opens in a new tab), self-consistency aims “to replace the naive greedy decoding used in chain-of-thought prompting”. The idea is to sample multiple, diverse reasoning paths through few-shot CoT, and use the generations to select the most consistent answer. This helps to boost the performance of CoT prompting on tasks involving arithmetic and commonsense reasoning.

Let’s try the following example for arithmetic reasoning:

Prompt:

Output:

The output is wrong! How may we improve this with self-consistency? Let’s try it out. We will use the few-shot exemplars from Wang et al. 2022 (Table 17):

Prompt:

Output 1:

Output 2:

Output 3:

Computing for the final answer involves a few steps (check out the paper for the details) but for the sake of simplicity, we can see that there is already a majority answer emerging so that would essentially become the final answer.

Generated Knowledge Prompting

Image Source: Liu et al. 2022(opens in a new tab)

LLMs continue to be improved and one popular technique includes the ability to incorporate knowledge or information to help the model make more accurate predictions.

Using a similar idea, can the model also be used to generate knowledge before making a prediction? That’s what is attempted in the paper by Liu et al. 2022(opens in a new tab) — generate knowledge to be used as part of the prompt. In particular, how helpful is this for tasks such as commonsense reasoning?

Let’s try a simple prompt:

Prompt:

Output:

This type of mistake reveals the limitations of LLMs to perform tasks that require more knowledge about the world. How do we improve this with knowledge generation?

First, we generate a few “knowledges”:

Prompt: