Question 1

Which AI model hallucinates the least?

Accepted Answer

Based on SimpleQA, TruthfulQA, and FreshQA benchmarks, Claude Opus 4 and o1 currently score highest on factual accuracy. But 'least hallucination' depends on the task, and no model is fully reliable for factual claims without verification.

Question 2

What is SimpleQA?

Accepted Answer

SimpleQA is a benchmark created by OpenAI in November 2024 to measure factuality. It consists of short-answer factual questions with verifiable answers. Most models score below 50%, showing that even top models struggle with basic factual accuracy.

Question 3

How is the Hallucination Index calculated?

Accepted Answer

The dataku Hallucination Index is a simple average of three benchmark scores: SimpleQA (factual accuracy on verifiable questions), TruthfulQA (resistance to common misconceptions), and FreshQA (accuracy on time-sensitive questions). Higher scores mean more factual.

Model	SimpleQA	TruthfulQA	FreshQA	Index v	Grade
Claude Opus 4Anthropic	41.2%	79.6%	66.3%	62.4	A
o1OpenAI	42.4%	74.1%	65.8%	60.8	A
Gemini 2.0 ProGoogle	35.6%	73.8%	71.4%	60.3	A
Claude Sonnet 4Anthropic	36.8%	78.2%	63.1%	59.4	B
o3-miniOpenAI	40.1%	72.8%	63.4%	58.8	B
GPT-4oOpenAI	38.2%	71.4%	62.0%	57.2	B
Claude 3.5 SonnetAnthropic	33.1%	76.4%	59.8%	56.4	B
Gemini 1.5 ProGoogle	29.4%	70.6%	68.2%	56.1	B
Grok 3xAI	31.8%	68.4%	64.2%	54.8	B
GPT-4 TurboOpenAI	34.5%	69.8%	58.3%	54.2	B
Claude 3 OpusAnthropic	28.9%	73.2%	55.1%	52.4	B
DeepSeek R1DeepSeek	30.2%	67.6%	53.6%	50.5	B
o1-miniOpenAI	28.6%	68.3%	54.2%	50.4	B
Gemini 2.0 FlashGoogle	22.1%	66.4%	60.8%	49.8	C
Mistral Large 2Mistral	25.4%	66.8%	52.3%	48.2	C
GPT-4o MiniOpenAI	24.8%	65.2%	51.7%	47.2	C
DeepSeek V3DeepSeek	24.9%	64.2%	49.8%	46.3	C
Llama 3.1 405BMeta	23.5%	63.8%	48.2%	45.2	C
Qwen 2.5 72BAlibaba	20.6%	62.1%	45.8%	42.8	C
Llama 3.1 70BMeta	18.2%	58.4%	42.6%	39.7	D
Llama 3.1 8BMeta	9.8%	47.2%	31.4%	29.5	F

Hallucination Index

Overall Hallucination Index (higher = more factual)

How to read this data