The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”

This article unpacks how large language models are evaluated on CRITICBENCH using few-shot chain-of-thought prompting. Unlike zero-shot methods, this approach ensures fair testing across both pretrained and instruction-tuned models by grounding judgments in principle-driven exemplars. Evaluation covers GSM8K, HumanEval, and TruthfulQA with carefully crafted prompts, multiple trials, and accuracy extracted from consistent output patterns—offering a rigorous lens into how well AI systems truly perform.


This content originally appeared on HackerNoon and was authored by Large Models (dot tech)

Abstract and 1. Introduction

  1. Definition of Critique Ability

  2. Construction of CriticBench

    3.1 Data Generation

    3.2 Data Selection

  3. Properties of Critique Ability

    4.1 Scaling Law

    4.2 Self-Critique Ability

    4.3 Correlation to Certainty

  4. New Capacity with Critique: Self-Consistency with Self-Check

  5. Conclusion, References, and Acknowledgments

A. Notations

B. CriticBench: Sources of Queries

C. CriticBench: Data Generation Details

D. CriticBench: Data Selection Details

E. CriticBench: Statistics and Examples

F. Evaluation Settings

F EVALUATION SETTINGS

To evaluate large language models on CRITICBENCH, we employ few-shot chain-of-thought prompting, rather than zero-shot. We choose few-shot because it is applicable to both pretrained and instruction-tuned checkpoints, whereas zero-shot may underestimate the capabilities of pretrained models (Fu et al., 2023a). The prompt design draws inspiration from Constitutional AI (Bai

\ Figure 8: Examples from Critic-GSM8K.

\ Figure 9: Examples from Critic-HumanEval.

\ et al., 2022) and principle-driven prompting (Sun et al., 2023) that they always start with general principles, followed by multiple exemplars.

\ In the evaluation process, we use a temperature of 0.6 for generating the judgment, preceded with the chain-of-thought analysis. Each model is evaluated 8 times, and the average accuracy is reported. The few-shot exemplars always end with the pattern "Judgment: X.", where X is either correct or incorrect. We search for this pattern in the model output and extract X. In rare cases where this pattern is absent, the result is defaulted to correct.

\ Figure 10: Examples from Critic-TruthfulQA.

F.1 PROMPT FOR CRITIC-GSM8K

Listing 2 shows the 5-shot chain-of-thought prompt used to evaluate on Critic-GSM8K. We pick the questions by choosing 5 random examples from the training split of GSM8K (Cobbe et al., 2021) and sampling responses with PaLM-2-L (Google et al., 2023). We manually select the responses with appropriate quality. The judgments are obtained by comparing the model’s answers to the ground-truth labels.

\

Listing 2: 5-shot chain-of-thought prompt for Critic-GSM8K.

F.2 PROMPT FOR CRITIC-HUMANEVAL

Listing 3 presents the 3-shot chain-of-thought prompt for Critic-HumanEval. Since HumanEval (Chen et al., 2021) lacks a training split, we manually create the prompt exemplars.

\

Listing 3: 3-shot chain-of-thought prompt for Critic-HumanEval.

F.3 PROMPT FOR CRITIC-TRUTHFULQA

Listing 4 presents the 5-shot chain-of-thought prompt for Critic-TruthfulQA. Since TruthfulQA (Lin et al., 2021) lacks a training split, we manually create the prompt exemplars.

\

Listing 4: 5-shot chain-of-thought prompt for Critic-TruthfulQA.

\

:::info Authors:

(1) Liangchen Luo, Google Research (luolc@google.com);

(2) Zi Lin, UC San Diego;

(3) Yinxiao Liu, Google Research;

(4) Yun Zhu, Google Research;

(5) Jingbo Shang, UC San Diego;

(6) Lei Meng, Google Research (leimeng@google.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\


This content originally appeared on HackerNoon and was authored by Large Models (dot tech)


Print Share Comment Cite Upload Translate Updates
APA

Large Models (dot tech) | Sciencx (2025-08-27T09:00:12+00:00) The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”. Retrieved from https://www.scien.cx/2025/08/27/the-prompt-patterns-that-decide-if-an-ai-is-correct-or-wrong/

MLA
" » The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”." Large Models (dot tech) | Sciencx - Wednesday August 27, 2025, https://www.scien.cx/2025/08/27/the-prompt-patterns-that-decide-if-an-ai-is-correct-or-wrong/
HARVARD
Large Models (dot tech) | Sciencx Wednesday August 27, 2025 » The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”., viewed ,<https://www.scien.cx/2025/08/27/the-prompt-patterns-that-decide-if-an-ai-is-correct-or-wrong/>
VANCOUVER
Large Models (dot tech) | Sciencx - » The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/08/27/the-prompt-patterns-that-decide-if-an-ai-is-correct-or-wrong/
CHICAGO
" » The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”." Large Models (dot tech) | Sciencx - Accessed . https://www.scien.cx/2025/08/27/the-prompt-patterns-that-decide-if-an-ai-is-correct-or-wrong/
IEEE
" » The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”." Large Models (dot tech) | Sciencx [Online]. Available: https://www.scien.cx/2025/08/27/the-prompt-patterns-that-decide-if-an-ai-is-correct-or-wrong/. [Accessed: ]
rf:citation
» The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong” | Large Models (dot tech) | Sciencx | https://www.scien.cx/2025/08/27/the-prompt-patterns-that-decide-if-an-ai-is-correct-or-wrong/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.