OpenAI Evals

A YAML-first benchmark registry where you contribute domain-specific evaluations that become part of a shared, publicly reusable test suite — no code required for most contributions.

GitHub: openai/evals · ~19k stars · MIT
Docs: github.com/openai/evals/blob/main/docs/build-eval.md

At a Glance

Dimension	Rating
Setup effort	● ● ○ ○ ○
Coding required	○ ○ ○ ○ ○ (YAML + JSONL, no code for basic evals)
RAG evaluation	● ○ ○ ○ ○
Conversational / general LLM	● ● ● ○ ○
Red teaming / adversarial	● ● ○ ○ ○
Production observability	○ ○ ○ ○ ○
EU AI Act support	● ● ○ ○ ○
Benchmark contribution	● ● ● ● ●
Cost at scale	Low (model API cost only)

What Makes This Different

OpenAI Evals is less a tool for evaluating your application and more a registry for contributing reusable evaluations that anyone can run against any model. The value proposition is:

Your eval becomes publicly available — useful for the community, visible to OpenAI, and associated with your name
No code required — a JSONL dataset + a YAML config is sufficient for most contributions
Model-graded evals included — you can specify that a judge model evaluates responses, not just string matching

This makes OpenAI Evals uniquely well-suited for product managers and domain experts who want to contribute evaluations from their area of expertise without writing Python.

Best For

Contributing domain-specific benchmark evals — if you have expertise in EU AI Act compliance, RAG quality patterns, or agentic behavior, you can codify that knowledge as a reusable eval
Testing model knowledge of specific domains — regulatory frameworks, industry standards, technical accuracy in niche areas
Model-graded quality evaluations — using a judge model to evaluate factual accuracy or reasoning quality
Teams that want to contribute to the public eval ecosystem — each merged PR to openai/evals is a publicly visible contribution with your GitHub handle attached
Simple comparison testing — does model A or model B handle this class of prompt better?

Not Great For

Application-specific quality regression — OpenAI Evals is a registry, not a test framework. DeepEval integrates better into your CI/CD pipeline for ongoing quality monitoring
Production observability — no tracing or real-time capabilities
RAG pipeline evaluation — no retrieval-specific metrics; RAGAS is the right tool
Deep adversarial testing — promptfoo and inspect_ai have more sophisticated red teaming capabilities

How It Works

The Two-File Model

A complete eval is two files:

1. Dataset (JSONL): One sample per line

{"input": [{"role": "user", "content": "Is a customer support chatbot for loan applications considered high-risk under the EU AI Act?"}], "ideal": "Yes"}
{"input": [{"role": "user", "content": "Does Article 9 of the EU AI Act require ongoing risk monitoring?"}], "ideal": "Yes"}

2. Config (YAML):

eu-ai-act-knowledge:
  id: eu-ai-act-knowledge
  description: Tests model knowledge of EU AI Act requirements for AI product teams
  metrics:
    - accuracy

eu-ai-act-knowledge/test:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: eu-ai-act-knowledge/test.jsonl

Run with: oaieval gpt-4o eu-ai-act-knowledge

That’s it. No Python required.

Model-Graded Evals

For more nuanced quality evaluation, you can use a judge model:

my-eval/test:
  class: evals.elsuite.modelgraded.classify:ModelGradedClassify
  args:
    samples_jsonl: my-eval/test.jsonl
    eval_type: cot_classify
    modelgraded_spec: fact

The judge model evaluates each response against the expected answer and produces a score. This is more expensive but handles open-ended responses where string matching fails.

Tradeoffs vs. Alternatives

vs. DeepEval Pick OpenAI Evals when you want to contribute to a public registry or need a simple YAML-based eval without Python. Pick DeepEval when you need pytest integration, CI/CD pipelines, or the full metric library (faithfulness, hallucination, etc.).

vs. promptfoo Both support YAML-first configuration. promptfoo is better for pre-production testing of your own application with assertion-based pass/fail. OpenAI Evals is better for contributing standardised benchmarks to the public ecosystem.

vs. inspect_ai OpenAI Evals asks “how does this model score on this benchmark?” inspect_ai asks “is this model safe to deploy?” They operate at different levels of rigor and institutional authority.

Integration Effort

Time to first eval: 20–45 minutes to write a dataset; ~5 minutes to run

pip install evals

oaieval gpt-4o my-eval

Works with: OpenAI models natively; other models via completion API compatibility
Language requirements: Python for setup and running; JSONL + YAML for eval authoring (no Python needed for the eval itself)

Cost at Scale

Cost is essentially the model API cost — no additional eval infrastructure fees.

Volume	Estimated cost (GPT-4o)
100 samples, match eval	~$0.01–$0.05
100 samples, model-graded	~$0.50–$2
1,000 samples, match eval	~$0.10–$0.50

Model-graded evals cost more because the judge model processes both the original response and the grading prompt.

PM-Specific Opportunity: Contributing Domain Evals

For a product manager with domain expertise, OpenAI Evals is the lowest-friction path to a meaningful open-source contribution. The contribution that lands well:

Pick a domain you know — EU AI Act requirements, RAG system design, product prioritisation, agentic workflow safety
Write 50–100 samples — questions with clear correct answers in your domain
Submit a PR — the build-eval.md guide walks through exactly what’s needed

A merged eval in openai/evals is visible on your GitHub profile, directly demonstrates domain expertise, and is cited by anyone who runs that eval against a model.

Example domains with open gaps (as of 2026-05-20):

EU AI Act Articles 9–17 practical application
Multi-agent coordination patterns (correct vs. incorrect architectures)
RAG system design decisions (when to use reranking, when not to)
AI product metrics definition accuracy

EU AI Act Relevance

Moderate. Primarily indirect.

Annex IV (Technical documentation): A published eval in the registry that tests AI Act compliance knowledge is citable evidence that the development team evaluated model knowledge of applicable regulations.
Article 17 (Quality management): Benchmark evals run as part of pre-deployment checks contribute to documented quality management processes.

Gap: No audit trail for production inferences, no adversarial safety testing, no data residency controls. For direct EU AI Act compliance work, inspect_ai or Langfuse provide more substantive evidence.

Version Tracked

Status: Active
Last verified: 2026-05-20
Context note: OpenAI acquired promptfoo in March 2026, consolidating some of their eval tooling investments. The openai/evals registry remains active and accepting community contributions.