LLM Evals Comparison

A PM-authored, vendor-neutral decision guide for choosing between LLM evaluation frameworks.

This is not a benchmarking paper or a feature matrix. It’s a selection guide — structured around real team situations, lifecycle stages, and compliance needs. Skip the 14-metric comparison tables. Answer “which framework should we use?” in under 5 minutes.

Frameworks covered: RAGAS · DeepEval · promptfoo · Langfuse · inspect_ai · OpenAI Evals

Quick Pick

Your situation	Start here
Building a RAG pipeline, need to measure retrieval quality	RAGAS
Writing LLM tests in CI/CD, want pytest-style eval assertions	DeepEval
Red teaming an LLM, testing adversarial prompts	promptfoo
Need production monitoring and observability for LLM traces	Langfuse
EU AI Act compliance, safety evaluation, government context	inspect_ai
Contributing evals to a shared benchmark registry	OpenAI Evals
Need to pick between RAGAS and DeepEval	RAGAS vs DeepEval
Need to pick between promptfoo and inspect_ai	promptfoo vs inspect_ai
Want all 6 frameworks side-by-side	Full matrix
3-person startup, where do I even start?	By Team Type
We need an eval stack, not just one tool	By Lifecycle Stage
Building a system subject to EU AI Act	EU AI Act module
Need to choose an eval framework for my org	Eval Framework RFP
Shipping an AI feature, need a pre-launch eval plan	Pre-Launch Eval Plan

Why This Exists

Every existing LLM eval comparison is either:

Vendor-authored (biased toward their product)
Engineer-authored (feature tables, no decision framing)
Incomplete (inspect_ai — used by Anthropic, DeepMind, Google for safety evals — is missing from almost every comparison)
EU AI Act blind (no resource maps framework choice to compliance requirements, despite August 2026 enforcement)

This guide is written from a product team’s perspective. The question isn’t “which framework has the most metrics?” It’s “given our team, our use case, and our risk profile, which framework should we start with on Monday?”

What’s Here

Framework Profiles

Each framework has a consistent profile covering: best-for, not-great-for, setup effort, cost model, EU AI Act relevance, and how it compares to alternatives.

→ frameworks/

Decision Guides

Three structured guides for choosing:

By Use Case — RAG, conversational AI, red teaming, production monitoring, safety audit
By Team Type — startup / ML platform team / regulated enterprise
By Lifecycle Stage — the insight that you probably need multiple frameworks at different stages

→ decision-guide/

Head-to-Head Comparisons

RAGAS vs DeepEval — the most common choice question
promptfoo vs inspect_ai — both do red teaming; which one?
Full matrix — all 6 × 10 decision criteria

→ comparisons/

EU AI Act Module

Which frameworks satisfy which Act requirements. Maps Articles 9, 10, 13, 15, 17, Annex IV to concrete framework choices. Includes a 5-phase pre-deployment checklist for Annex III high-risk AI systems.

EU AI Act overview — enforcement timeline, Annex III categories, articles that create eval obligations
Framework → Article mapping — master mapping table + red-flag matrix of compliance gaps
High-risk AI checklist — 5-phase pre-deployment checklist, recommended stacks by risk tier

→ eu-ai-act/

Templates

Working documents you can copy and use directly.

Eval Framework RFP — scoring template for choosing eval frameworks for your org; weighted by team type
Pre-Launch Eval Plan — defines metrics, baselines, thresholds, and sign-off gate before shipping an AI feature

→ templates/

A Note on Scope

This guide covers frameworks for evaluating LLM outputs and behaviors. It does not cover:

LLM benchmark leaderboards (MMLU, HumanEval, etc.)
Model selection or comparison
Data labeling or annotation platforms

Evals move fast. See CHANGELOG.md for framework version tracking.

Contributing

See CONTRIBUTING.md. The most useful contributions are: framework version updates, new use-case examples, and corrections to the EU AI Act mapping.

Maintained by @aadhar-build. Not affiliated with any framework vendor.