# Learn about LLM Benchmarks

### Why are LLM Benchmarks Important?

{% stepper %}
{% step %}

#### This provides real data, not hype.

Benchmarks give you an objective way to evaluate models across reasoning, coding, math, language understanding, and safety. This helps you base decisions on measurable performance rather than claims alone.
{% endstep %}

{% step %}

#### Helps you match the right model to your use case.

Every model has strengths. Some lead on coding, while others perform better in reasoning or multilingual tasks. Understanding these differences helps you choose the model that best fits what you are trying to do.
{% endstep %}

{% step %}

#### Using multiple benchmarks gives you a clearer picture.

Models can perform very differently depending on what is being tested. Looking at multiple benchmarks gives you a fuller and more accurate view of a model’s capabilities.
{% endstep %}

{% step %}

#### They help guide your decision, but they do not provide definitive answers.

A strong score reflects performance under specific, controlled conditions. It does not guarantee the same results in your environment. Benchmarks should be used to narrow your options, then validated through real-world testing before making a final decision.
{% endstep %}
{% endstepper %}

### List of LLM Benchmarks

* Streamlined Benchmark List: A quick reference to the benchmarks and leaderboards most commonly used to compare frontier models today.
* Full Benchmark List: A broader reference across major capability areas, for when you need a deeper or more specific evaluation.

{% tabs %}
{% tab title="Streamlined Benchmark List" %}

<table><thead><tr><th width="174.22222900390625">Benchmark</th><th width="278.77777099609375">What It Evaluates</th><th>Status</th></tr></thead><tbody><tr><td><strong>MMLU-Pro</strong></td><td>Graduate-level knowledge across multiple disciplines, using 10 answer choices instead of 4 to reduce guesswork and better separate model performance</td><td>Active - remains a strong differentiator, with scores typically lower than standard MMLU</td></tr><tr><td><strong>GPQA Diamond</strong></td><td>Expert-level scientific reasoning across biology, physics, and chemistry, built to test frontier reasoning beyond standard knowledge recall</td><td>Active differentiator - one of the strongest benchmarks for advanced scientific reasoning</td></tr><tr><td><strong>AA Quality Index</strong></td><td>A composite intelligence score from Artificial Analysis that combines results across multiple benchmarks into a single comparative metric</td><td>Active - updated as new models and benchmarks are added</td></tr><tr><td><strong>Chatbot Arena (LMSYS)</strong></td><td>Human preference in open-ended conversations, based on blind pairwise comparisons between models</td><td>Widely referenced - reflects real user preference rather than only controlled test performance</td></tr><tr><td><strong>LiveCodeBench</strong></td><td>Code generation on newly released competitive programming tasks, designed to reduce training data contamination</td><td>Active - regularly refreshed, making it one of the most current coding benchmarks available</td></tr><tr><td><strong>AIME 2025</strong></td><td>Advanced mathematical reasoning through olympiad-style problems that require multi-step problem solving</td><td>Active and highly challenging - few frontier models perform strongly here</td></tr><tr><td><strong>SWE-bench Verified</strong></td><td>Real-world software engineering through verified GitHub issue resolution across full codebases</td><td>Gold standard for coding - evaluates practical engineering ability beyond isolated code generation</td></tr></tbody></table>
{% endtab %}

{% tab title="Full Benchmark List " %}

<table><thead><tr><th>Benchmark</th><th width="454.2222900390625">What it Evaluates</th><th>Status</th></tr></thead><tbody><tr><td><strong>MMLU</strong></td><td>Broad knowledge across 57 academic subjects, including STEM, humanities, and professional disciplines</td><td>Saturated</td></tr><tr><td><strong>MMLU-Pro</strong></td><td>A more difficult version of MMLU with 10 answer choices that reduce guesswork and better separate model performance</td><td>Active</td></tr><tr><td><strong>GPQA Diamond</strong></td><td>Expert-level scientific reasoning across biology, physics, and chemistry</td><td>Active</td></tr><tr><td><strong>ARC-AGI 2</strong></td><td>Abstract pattern recognition and reasoning from first principles rather than memorized knowledge</td><td>Active</td></tr><tr><td><strong>Humanity's Last Exam</strong></td><td>Extremely difficult expert-written questions across a wide range of academic domains</td><td>Active</td></tr><tr><td><strong>GSM8K</strong></td><td>Basic multi-step math word problems at grade-school level</td><td>Saturated</td></tr><tr><td><strong>MATH</strong></td><td>Competition-level mathematics that requires structured reasoning and free-form answers</td><td>Active</td></tr><tr><td><strong>AIME 2025</strong></td><td>Olympiad-level mathematical problem solving with deep multi-step reasoning</td><td>Active</td></tr><tr><td><strong>HumanEval</strong></td><td>Python function generation from natural language prompts, scored through unit-test correctness</td><td>Saturated</td></tr><tr><td><strong>HumanEval+</strong></td><td>A stricter extension of HumanEval with stronger test coverage and more edge cases</td><td>Active</td></tr><tr><td><strong>LiveCodeBench</strong></td><td>Code generation on fresh competitive programming problems updated regularly to reduce contamination risk</td><td>Active</td></tr><tr><td><strong>SWE-bench Verified</strong></td><td>Real software engineering through verified issue resolution across full codebases</td><td>Gold standard</td></tr><tr><td><strong>SWE-bench Pro</strong></td><td>Repository-level software engineering evaluation with broader language support</td><td>Emerging</td></tr><tr><td><strong>IFEval</strong></td><td>How accurately a model follows specific, verifiable instructions with constrained output requirements</td><td>Active</td></tr><tr><td><strong>BFCL v4</strong></td><td>Tool use and function calling across serial, parallel, multi-turn, and agentic workflows</td><td>Widely used</td></tr><tr><td><strong>RULER</strong></td><td>Long-context retrieval, tracking, and synthesis across extended documents</td><td>Active</td></tr><tr><td><strong>MMMU Pro</strong></td><td>Multimodal reasoning across academic subjects using both text and visual inputs</td><td>Active</td></tr><tr><td><strong>TruthfulQA</strong></td><td>Factual reliability and resistance to common misconceptions and hallucination-prone prompts</td><td>Contaminated</td></tr><tr><td><strong>HELM</strong></td><td>Multi-dimensional evaluation across accuracy, calibration, robustness, fairness, bias, and efficiency</td><td>Framework</td></tr><tr><td><strong>Chatbot Arena (LMSYS)</strong></td><td>Human preference in open-ended conversations through blind side-by-side model comparisons</td><td>Widely used</td></tr></tbody></table>

* Active = still useful for separating model performance today
* Saturated = top models score too closely for strong differentiation
* Emerging = newer benchmark with growing adoption
* Gold standard = strongest reference point in its category
* Widely used = commonly referenced in practice
* Contaminated = results may be less reliable due to training overlap
* Framework = better for broad evaluation than direct ranking
  {% endtab %}
  {% endtabs %}

### Frontier Model Benchmark Snapshot (May 2026)

* A directional comparison of leading models across publicly reported benchmarks. Blank cells indicate that a directly comparable public value was not confirmed in the source set used here.

| Model                        | GPQA Diamond | SWE-bench Verified | ARC-AGI-2 | HLE                                       |
| ---------------------------- | ------------ | ------------------ | --------- | ----------------------------------------- |
| **GPT-5.4**                  | 92%          | -                  | 73.3%     | <p>39.8% no tools<br>52.1% with tools</p> |
| **GPT-5.3-Codex**            | 83.7%        | -                  | -         | -                                         |
| **GPT-5.2**                  | 71.2%        | 72.8%              | 52.9%     | <p>34.5% no tools<br>45.5% with tools</p> |
| **Claude Opus 4.6**          | 84.0%        | 75.6%              | 68.8%     | <p>40.0% no tools<br>53.0% with tools</p> |
| **Claude Sonnet 4.6**        | 79.9%        | -                  | 58.3%     | <p>33.2% no tools<br>49.0% with tools</p> |
| **Claude Opus 4.5**          | 86.6%        | 76.8%              | -         | -                                         |
| **Claude Sonnet 4.5**        | 83.4%        | 71.4%              | -         | -                                         |
| **Claude Haiku 4.5**         | 64.6%        | 66.6%              | -         | -                                         |
| **Gemini 3.1 Pro (Preview)** | 94.1%        | 80.6%              | 77.1%     | <p>44.4% no tools<br>51.4% with tools</p> |
| **Gemini 3 Flash**           | 89.8%        | 75.8%              | -         | 33.7%                                     |

### Benchmarks by Use Case

* Different tasks require different evaluation signals. This table highlights the benchmarks that are most relevant for common LLM use cases, so you can focus on the scores that best match the task at hand.

<table><thead><tr><th width="249">Use Case</th><th>Primary Benchmarks</th><th>Additional References</th></tr></thead><tbody><tr><td>General Knowledge and Q&#x26;A</td><td>MMLU-Pro, Chatbot Arena (LMSYS)</td><td>MMLU</td></tr><tr><td>Code Generation</td><td>SWE-bench Verified, LiveCodeBench, SWE-bench Pro</td><td>HumanEval+, BFCL v4</td></tr><tr><td>Mathematical Reasoning</td><td>AIME 2025, MATH</td><td>GSM8K</td></tr><tr><td>Scientific Reasoning</td><td>GPQA Diamond</td><td>Humanity's Last Exam</td></tr><tr><td>Creative Writing</td><td>Chatbot Arena Creative Writing</td><td>-</td></tr><tr><td>Instruction Following</td><td>IFEval</td><td>Chatbot Arena (LMSYS)</td></tr><tr><td>Tool Use and Function Calling</td><td>BFCL v4</td><td>-</td></tr><tr><td>Long-Context Understanding</td><td>RULER, Needle-in-a-Haystack</td><td>LongGenBench</td></tr><tr><td>Multimodal and Vision</td><td>MMMU Pro, Arena Vision</td><td>MMMU</td></tr><tr><td>Multilingual Tasks</td><td>MMMLU</td><td>MLNeedle</td></tr><tr><td>Agentic Workflows</td><td>SWE-bench, BFCL v4</td><td>WebArena, OSWorld</td></tr><tr><td>Safety and Factual Reliability</td><td>HalluLens, SimpleQA</td><td>-</td></tr></tbody></table>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.blockbrain.ai/for-users/all-about-llms/learn-about-llm-benchmarks.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
