Our Methodology
We believe in full transparency. Here's exactly how we collect, verify, and update benchmark data — and what each benchmark actually measures.
✓ Editorial Independence
AI Benchmarks is an independent platform. We are not affiliated with, funded by, or in any commercial relationship with OpenAI, Anthropic, Google, Meta, xAI, or Mistral AI. No AI company can pay to improve their ranking.
Data Sources
All benchmark scores are sourced from peer-reviewed publications, official technical reports, and the LMSYS Chatbot Arena leaderboard. We do not run benchmarks ourselves — we aggregate and verify scores from primary sources.
Our primary sources include official model cards and technical reports from each AI provider, the LMSYS Chatbot Arena human preference leaderboard, arXiv preprints for independently replicated results, and community-verified benchmark repositories on GitHub.
When scores differ between sources, we use the most recently published, peer-reviewed figure. We note discrepancies in the model detail pages.
Benchmark Definitions & Weights
57 subjects across STEM, social sciences, humanities. Tests broad world knowledge and problem-solving. Higher scores indicate more comprehensive knowledge.
164 hand-crafted programming challenges. Models generate Python code that is evaluated by running unit tests. The gold standard for coding ability.
12,500 competition mathematics problems across algebra, calculus, geometry, and more. Tests deep mathematical reasoning.
8,500 grade school math word problems requiring multi-step reasoning. A good proxy for everyday numerical reasoning.
448 expert-crafted questions in biology, chemistry, and physics. Designed so that Google search cannot easily answer them.
23 challenging tasks from the BIG-Bench benchmark that resist simple few-shot prompting, requiring genuine reasoning.
Human preference rating from blind A/B comparisons at LMSYS Chatbot Arena. The most realistic measure of real-world usefulness.
Update Process
Arena ELO scores are updated weekly via the LMSYS Chatbot Arena public leaderboard API. All other benchmark scores are reviewed monthly and updated when new official figures are published.
Our automated update script (scripts/fetch-arena.js) fetches the latest ELO data and opens a pull request on our GitHub repository. A human reviewer verifies the changes before merging.
When a provider releases a new model version, we add it to our database within 72 hours of the official announcement, using scores from the official technical report.
Known Limitations
Benchmarks are not perfect. MMLU and HumanEval scores may be inflated for models trained on contaminated datasets. We flag known contamination issues when reported in peer-reviewed literature.
Performance on benchmarks does not always predict real-world usefulness. We include LMSYS Arena ELO (human preference) specifically to counterbalance this limitation.
Pricing data reflects public list prices and may not reflect negotiated enterprise discounts. Always verify pricing directly with the provider before making purchasing decisions.
Found an error? Contact us with a source link and we'll review it within 48 hours.