Data StoriesMay 20, 20246 min read

The 'vibe check' era: why benchmarks are losing to vibes

I asked 50 AI developers how they evaluate models. 73% said 'I just try it and see how it feels.' Only 12% run formal benchmarks. The industry is moving from data-driven evaluation to... vibes. I have mixed feelings about this.

This one hurts to write.

I'm a data person. I've spent three years building benchmark tables, running controlled evaluations, and tracking scores across models. My entire identity as a writer is "the person who shows you the numbers."

And the industry is moving to vibes.

The survey

I surveyed 50 AI developers (mix of indie developers, startup engineers, and big tech engineers) with one question: "How do you decide which LLM to use for a new project?"

| Evaluation method | Respondents | Percentage | |------------------|-------------|-----------| | "I try it and see how it feels" | 37 | 73% | | "I check LMSYS Chatbot Arena Elo ratings" | 18 | 36% | | "I ask friends/Twitter what they use" | 16 | 32% | | "I read benchmark comparisons" | 14 | 28% | | "I run my own evaluations" | 6 | 12% | | "I use Artificial Analysis or similar" | 4 | 8% |

Source: My survey, 50 AI developers, April-May 2024. Multiple answers allowed.

73% go with their gut. Twelve percent run formal evaluations. I am in the 12%.

The "try it and see" approach isn't lazy. These developers are sophisticated people building real products. They've looked at benchmarks. They know the numbers exist. They just don't think the numbers predict how a model will perform on their specific task.

And... they might be right? That's the part that hurts.

Why benchmarks are losing trust

I asked the 37 "vibe check" respondents why they don't rely on benchmarks. The answers clustered into 4 themes:

| Reason | Mentioned by | Example quote | |--------|-------------|--------------| | Benchmarks don't match my use case | 28/37 (76%) | "MMLU doesn't tell me if the model writes good marketing copy" | | Contamination concerns | 19/37 (51%) | "Half these models trained on the test set" | | Too many benchmarks, no clear signal | 15/37 (41%) | "Model A wins on MMLU, Model B wins on HumanEval, what do I do?" | | Scores too close to differentiate | 12/37 (32%) | "86% vs 88% on MMLU doesn't mean anything to me" |

Source: My survey follow-up interviews, May 2024.

I can't argue with any of these.

Use case mismatch is the biggest one. I write about MMLU scores all the time. But MMLU measures academic knowledge across 57 subjects. If you're building a customer support bot, what does a score on "college medicine" tell you? Almost nothing.

Contamination is real. I've written about this. Multiple models have been trained on benchmark test data, intentionally or not. When the test set leaks into the training set, the score measures memorization, not capability.

Too many benchmarks is a design problem. Here are the benchmarks I track for any given model: MMLU, HumanEval, GSM8K, MATH, GPQA, ARC, WinoGrande, HellaSwag, TruthfulQA, DROP, BIG-Bench-Hard, SWE-bench, LMSYS Elo. That's 13 numbers. Different models win on different benchmarks. There's no single score to compare.

Scores too close is the frontier model problem. When GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro are all within 2-4 points on MMLU, the difference is within noise. You can't meaningfully distinguish them on a benchmark where the measurement error is larger than the gap.

The case FOR vibes (I'm trying to be fair)

Let me steel-man the vibe check approach:

| Benchmark evaluation | Vibe check evaluation | |---------------------|----------------------| | Tests a fixed set of academic tasks | Tests your actual task | | Aggregated across thousands of questions | Focused on your specific use case | | Objective but possibly contaminated | Subjective but real | | Measures "average" capability | Measures "specific" capability | | Done once per model release | Done continuously as models update |

When a developer sends 20 of their real prompts to Claude and GPT-4o and picks whichever "feels better," they're running a highly relevant but statistically underpowered evaluation. It's the right test with too few samples.

When I run 300 prompts through a formal evaluation, I have statistical power but questionable relevance. I'm measuring the right quantity of the wrong thing.

The ideal is both: run your real prompts at benchmark scale. But nobody has time for that except me, apparently.

The Chatbot Arena middle ground

LMSYS Chatbot Arena is interesting because it's kind of both. Real users send real prompts and vote on which response they prefer. Crowd-sourced vibes with statistical rigor.

| Evaluation method | Statistical power | Task relevance | Contamination risk | |------------------|------------------|---------------|-------------------| | Academic benchmarks (MMLU, etc.) | High | Low-medium | High | | My 300-prompt eval | Medium | Medium | Low | | Chatbot Arena (Elo) | High | High (crowdsourced) | Low | | Individual vibe check | Low | Very high | None |

This is probably why Chatbot Arena has become the most-cited evaluation in AI. It combines real tasks with statistical rigor. The 36% of developers who check LMSYS ratings are getting the best of both worlds.

But Chatbot Arena has its own biases: it over-represents English-speaking tech workers, it rewards verbose responses (longer often wins), and it can't evaluate specialized tasks (medical, legal, domain-specific).

Where I've landed

I'm going to keep running benchmarks. I'm going to keep building my tables and tracking my spreadsheet. But I'm going to be more honest about what the numbers do and don't tell us.

What benchmarks are good for:

  • Detecting capability jumps between model generations
  • Comparing models within the same family (Llama 3 8B vs 70B)
  • Identifying specific weaknesses (bad at math, good at code)
  • Historical tracking of progress over time

What benchmarks are bad for:

  • Predicting which model is "best" for your specific task
  • Distinguishing between frontier models that are within 2-3 points
  • Measuring real-world generation quality (tone, style, helpfulness)
  • Evaluating models that have been trained on the test data

My data analyst heart wants benchmarks to be the answer. They're not. They're one input alongside vibes, Chatbot Arena, and actual usage data.

If 73% of developers have moved to vibes, the benchmark community (including me) needs to ask why. The answer isn't that developers are wrong. It's that our benchmarks haven't kept up with what developers actually need to measure.

I'll still show you the numbers. But I'll be more careful about what I claim they mean.


If you found this interesting, you might also like:

-- dataku

More from dataku