Hallucination Index

How often does each model make things up? Scores from SimpleQA, TruthfulQA, and FreshQA, combined into one index. Higher is better.

ModelSimpleQATruthfulQAFreshQAIndex vGrade
Claude Opus 4Anthropic41.2%79.6%66.3%62.4A
o1OpenAI42.4%74.1%65.8%60.8A
Gemini 2.0 ProGoogle35.6%73.8%71.4%60.3A
Claude Sonnet 4Anthropic36.8%78.2%63.1%59.4B
o3-miniOpenAI40.1%72.8%63.4%58.8B
GPT-4oOpenAI38.2%71.4%62.0%57.2B
Claude 3.5 SonnetAnthropic33.1%76.4%59.8%56.4B
Gemini 1.5 ProGoogle29.4%70.6%68.2%56.1B
Grok 3xAI31.8%68.4%64.2%54.8B
GPT-4 TurboOpenAI34.5%69.8%58.3%54.2B
Claude 3 OpusAnthropic28.9%73.2%55.1%52.4B
DeepSeek R1DeepSeek30.2%67.6%53.6%50.5B
o1-miniOpenAI28.6%68.3%54.2%50.4B
Gemini 2.0 FlashGoogle22.1%66.4%60.8%49.8C
Mistral Large 2Mistral25.4%66.8%52.3%48.2C
GPT-4o MiniOpenAI24.8%65.2%51.7%47.2C
DeepSeek V3DeepSeek24.9%64.2%49.8%46.3C
Llama 3.1 405BMeta23.5%63.8%48.2%45.2C
Qwen 2.5 72BAlibaba20.6%62.1%45.8%42.8C
Llama 3.1 70BMeta18.2%58.4%42.6%39.7D
Llama 3.1 8BMeta9.8%47.2%31.4%29.5F

Overall Hallucination Index (higher = more factual)

Claude Opus 4
62.4
o1
60.8
Gemini 2.0 Pro
60.3
Claude Sonnet 4
59.4
o3-mini
58.8
GPT-4o
57.2
Claude 3.5 Sonnet
56.4
Gemini 1.5 Pro
56.1
Grok 3
54.8
GPT-4 Turbo
54.2
Claude 3 Opus
52.4
DeepSeek R1
50.5
o1-mini
50.4
Gemini 2.0 Flash
49.8
Mistral Large 2
48.2
GPT-4o Mini
47.2
DeepSeek V3
46.3
Llama 3.1 405B
45.2
Qwen 2.5 72B
42.8
Llama 3.1 70B
39.7
Llama 3.1 8B
29.5

How to read this data

The three benchmarks measure different things:

  • SimpleQA (OpenAI, 2024): Short factual questions with verifiable answers. Most models score below 50%. This is the hardest test of pure factual knowledge.
  • TruthfulQA (Lin et al., 2022): Questions designed to trigger common misconceptions. Tests whether models repeat popular falsehoods vs. giving truthful (sometimes counterintuitive) answers.
  • FreshQA (Vu et al., 2023): Questions about recent events and time-sensitive facts. Tests whether models know their knowledge is stale and say so honestly.

The Overall Index is a simple average of all three scores. It's imperfect (each benchmark measures different aspects of honesty), but it gives a useful single number for comparison.

What surprised me most: even the best models barely crack 40% on SimpleQA. We're nowhere near "reliable factual AI." Reasoning models (o1, o3-mini) do better, possibly because they can cross-check their own answers during the reasoning chain. But 42% correct on basic factual questions is still... not great.