openai/simple-evals

Created 5/23/2026 at 4:07:26 PM • Edited 5/23/2026 at 4:07:44 PM

July 2025: simple-evals will no longer be updated for new models or benchmark results. The repo will continue to host reference implementations for HealthBench, BrowseComp, and SimpleQA.

Evals are sensitive to prompting, and there's significant variation in the formulations used in recent publications and libraries. Some use few-shot prompts or role playing prompts ("You are an expert software programmer..."). These approaches are carryovers from evaluating base models (rather than instruction/chat-tuned models) and from models that were worse at following instructions.

For this library, we are emphasizing the zero-shot, chain-of-thought setting, with simple instructions like "Solve the following multiple choice problem". We believe that this prompting technique is a better reflection of the models' performance in realistic usage.

We will not be actively maintaining this repository and monitoring PRs and Issues. In particular, we're not accepting new evals. Here are the changes we might accept.
Bug fixes (hopefully not needed!)
Adding adapters for new models
Adding new rows to the table below with eval results, given new models and new system prompts.
This repository is NOT intended as a replacement for https://github.com/openai/evals, which is designed to be a comprehensive collection of a large number of evals.

benchmarks evaluations openai llm

Public