Vision model benchmarks: who can actually read a chart?

Multimodal models can "see" images. But can they actually read a chart? Extract the right numbers? Understand what the axes mean?

I tested 8 models with 50 real-world visual data items. The results vary more than I expected.

The test set

| Visual type | Count | Examples | |------------|-------|---------| | Bar charts | 12 | Revenue charts, survey results | | Line charts | 10 | Time series, trend data | | Tables (image format) | 10 | Financial tables, spec sheets | | Pie charts | 5 | Market share, budget allocation | | Scatter plots | 5 | Correlation data | | Diagrams/flowcharts | 5 | Architecture diagrams, process flows | | Handwritten notes | 3 | Whiteboard photos, hand-drawn charts |

Each image came with 3 questions: a factual extraction ("What is the value for Q3?"), an interpretation ("Which category grew fastest?"), and a comparison ("Is X bigger than Y?").

Overall accuracy

| Model | Factual extraction | Interpretation | Comparison | Overall | |-------|-------------------|---------------|-----------|---------| | Claude Opus 4 | 92% | 88% | 87% | 89% | | Gemini 2.5 Pro | 88% | 84% | 83% | 85% | | GPT-4o | 85% | 80% | 81% | 82% | | Claude 4 Sonnet | 86% | 82% | 79% | 82% | | Gemini 2.0 Flash | 80% | 76% | 75% | 77% | | GPT-4o mini | 72% | 68% | 70% | 70% | | Llama 4 Maverick | 68% | 64% | 65% | 66% | | Qwen3 VL | 74% | 70% | 71% | 72% |

Sources: My evaluation, 50 images x 3 questions = 150 total data points per model. Anthropic, OpenAI, Google.

Claude Opus 4 leads at 89% overall. Its factual extraction accuracy (92%) is the highest. When a chart says "Q3 revenue was $4.7M," Claude reads "$4.7M" correctly 92% of the time.

Gemini 2.5 Pro at 85% is a solid second. GPT-4o and Claude 4 Sonnet tie at 82%.

Accuracy by visual type

| Visual type | Claude Opus 4 | GPT-4o | Gemini 2.5 Pro | |------------|--------------|--------|---------------| | Bar charts | 95% | 88% | 90% | | Line charts | 91% | 84% | 88% | | Tables (image) | 93% | 86% | 89% | | Pie charts | 87% | 80% | 82% | | Scatter plots | 84% | 76% | 80% | | Diagrams | 82% | 78% | 81% | | Handwritten | 72% | 62% | 68% |

All models are best at bar charts and tables (simple, clear structure). All models struggle with handwritten text (72% is Claude Opus 4's best, and that's mediocre).

Scatter plots are surprisingly hard for AI models. Reading exact values from a scatter plot requires precise spatial reasoning, and 84% from the best model means 1 in 6 readings are wrong.

Common failure modes

| Failure type | Frequency | Example | |-------------|----------|---------| | Off-by-one reading | 23% of errors | Reading $4.7M as $4.8M from a bar chart | | Axis confusion | 18% of errors | Mixing up X and Y axis labels | | Small text misread | 15% of errors | Reading "2023" as "2025" in axis labels | | Legend misattribution | 14% of errors | Attributing the wrong color to a data series | | Hallucinated value | 12% of errors | Reporting a number that doesn't exist in the chart | | Handwriting error | 10% of errors | Misreading handwritten characters | | Other | 8% of errors | Various |

"Off-by-one reading" is the most common error: reading an adjacent value instead of the target value. This happens when bars or data points are close together, and the model's spatial resolution isn't precise enough.

Hallucinated values (12% of errors) are concerning. The model confidently reports a specific number that simply isn't in the chart. This is worse than "I can't read this" because it looks authoritative.

The handwriting problem

| Handwriting quality | Claude Opus 4 | GPT-4o | Gemini 2.5 Pro | |-------------------|--------------|--------|---------------| | Neat printing | 84% | 74% | 78% | | Cursive | 62% | 51% | 58% | | Whiteboard (marker) | 70% | 61% | 68% |

Even the best model drops to 62% on cursive handwriting. If you're photographing whiteboard notes and feeding them to an AI, expect roughly 30% error rates.

Practical recommendation

| Task | Recommended model | Expected accuracy | |------|-----------------|------------------| | Reading printed charts | Claude Opus 4 or Gemini 2.5 Pro | 90%+ | | Extracting data from tables | Claude Opus 4 | 93% | | Interpreting complex diagrams | Claude Opus 4 | 82% | | Handwritten content | None are reliable | <75% | | High-stakes data extraction | Human verification required | N/A |

For anything financial, medical, or legal: verify AI chart readings against the source data. 89% accuracy sounds high, but 11% error rate on financial data could mean real-dollar mistakes.

My personal use: I feed charts to Claude Opus 4 to get a quick reading, then spot-check the critical numbers manually. It saves time, but I don't trust it blindly.

The charts in my own articles, ironically, are now being read by AI models. The circle of data life.

If you found this interesting, you might also like:

-- dataku

Vision model benchmarks: who can actually read a chart?

The test set

Overall accuracy

Accuracy by visual type

Common failure modes

The handwriting problem

Practical recommendation

More from dataku

My monthly benchmark dashboard: March 2026 update

Claude Opus 4.5: Anthropic's latest flagship, benchmarked

The state of AI benchmarks in early 2026: what still works?