Question 1

What does MMLU measure?

Accepted Answer

MMLU (Massive Multitask Language Understanding) tests a model's knowledge across 57 subjects including STEM, humanities, social sciences, and professional fields. It measures breadth of knowledge, not depth on any one topic.

Question 2

What is HumanEval?

Accepted Answer

HumanEval measures a model's ability to write correct Python code. It consists of 164 programming problems where the model must generate complete functions that pass unit tests.

Question 3

What is the GPQA benchmark?

Accepted Answer

GPQA (Graduate-Level Google-Proof Questions) tests expert-level reasoning with questions so hard that even domain experts struggle without references. It targets physics, biology, and chemistry at PhD level.

Benchmark Decoder