Humanity's Last Exam

Created 5/23/2026 at 5:20:34 PM • Edited 5/23/2026 at 5:22:18 PM

benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. The dataset consists of 2,500 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting.

Accuracy. All frontier models achieve low accuracy on Humanity's Last Exam, highlighting significant room for improvement in narrowing the gap between current LLMs and expert-level academic capabilities on closed-ended questions.

Hahaha current state of the art is Gemini 3 Pro w/ 38.3:

Given the rapid pace of AI development, it is plausible that models could exceed 50% accuracy on HLE by the end of 2025.

benchmarks evaluations llm

Public