Hallucination Index
How often does each model make things up? Scores from SimpleQA, TruthfulQA, and FreshQA, combined into one index. Higher is better.
| Model | SimpleQA | TruthfulQA | FreshQA | Index v | Grade |
|---|---|---|---|---|---|
| Claude Opus 4Anthropic | 41.2% | 79.6% | 66.3% | 62.4 | A |
| o1OpenAI | 42.4% | 74.1% | 65.8% | 60.8 | A |
| Gemini 2.0 ProGoogle | 35.6% | 73.8% | 71.4% | 60.3 | A |
| Claude Sonnet 4Anthropic | 36.8% | 78.2% | 63.1% | 59.4 | B |
| o3-miniOpenAI | 40.1% | 72.8% | 63.4% | 58.8 | B |
| GPT-4oOpenAI | 38.2% | 71.4% | 62.0% | 57.2 | B |
| Claude 3.5 SonnetAnthropic | 33.1% | 76.4% | 59.8% | 56.4 | B |
| Gemini 1.5 ProGoogle | 29.4% | 70.6% | 68.2% | 56.1 | B |
| Grok 3xAI | 31.8% | 68.4% | 64.2% | 54.8 | B |
| GPT-4 TurboOpenAI | 34.5% | 69.8% | 58.3% | 54.2 | B |
| Claude 3 OpusAnthropic | 28.9% | 73.2% | 55.1% | 52.4 | B |
| DeepSeek R1DeepSeek | 30.2% | 67.6% | 53.6% | 50.5 | B |
| o1-miniOpenAI | 28.6% | 68.3% | 54.2% | 50.4 | B |
| Gemini 2.0 FlashGoogle | 22.1% | 66.4% | 60.8% | 49.8 | C |
| Mistral Large 2Mistral | 25.4% | 66.8% | 52.3% | 48.2 | C |
| GPT-4o MiniOpenAI | 24.8% | 65.2% | 51.7% | 47.2 | C |
| DeepSeek V3DeepSeek | 24.9% | 64.2% | 49.8% | 46.3 | C |
| Llama 3.1 405BMeta | 23.5% | 63.8% | 48.2% | 45.2 | C |
| Qwen 2.5 72BAlibaba | 20.6% | 62.1% | 45.8% | 42.8 | C |
| Llama 3.1 70BMeta | 18.2% | 58.4% | 42.6% | 39.7 | D |
| Llama 3.1 8BMeta | 9.8% | 47.2% | 31.4% | 29.5 | F |
Overall Hallucination Index (higher = more factual)
Claude Opus 4
62.4
o1
60.8
Gemini 2.0 Pro
60.3
Claude Sonnet 4
59.4
o3-mini
58.8
GPT-4o
57.2
Claude 3.5 Sonnet
56.4
Gemini 1.5 Pro
56.1
Grok 3
54.8
GPT-4 Turbo
54.2
Claude 3 Opus
52.4
DeepSeek R1
50.5
o1-mini
50.4
Gemini 2.0 Flash
49.8
Mistral Large 2
48.2
GPT-4o Mini
47.2
DeepSeek V3
46.3
Llama 3.1 405B
45.2
Qwen 2.5 72B
42.8
Llama 3.1 70B
39.7
Llama 3.1 8B
29.5
How to read this data
The three benchmarks measure different things:
- SimpleQA (OpenAI, 2024): Short factual questions with verifiable answers. Most models score below 50%. This is the hardest test of pure factual knowledge.
- TruthfulQA (Lin et al., 2022): Questions designed to trigger common misconceptions. Tests whether models repeat popular falsehoods vs. giving truthful (sometimes counterintuitive) answers.
- FreshQA (Vu et al., 2023): Questions about recent events and time-sensitive facts. Tests whether models know their knowledge is stale and say so honestly.
The Overall Index is a simple average of all three scores. It's imperfect (each benchmark measures different aspects of honesty), but it gives a useful single number for comparison.
What surprised me most: even the best models barely crack 40% on SimpleQA. We're nowhere near "reliable factual AI." Reasoning models (o1, o3-mini) do better, possibly because they can cross-check their own answers during the reasoning chain. But 42% correct on basic factual questions is still... not great.