By Lifecycle Stage
The key insight most eval guides miss: you don’t pick one framework. You need different tools at different stages of development. This guide maps the right framework(s) to each stage.
The Core Insight
LLM eval is not a one-time exercise. The questions you’re answering at each stage are fundamentally different:
| Stage | The question | Right tool |
|---|---|---|
| 0 — Prototype | “Does this idea even work?” | Informal / vibes |
| 1 — Development | “Does my output quality meet the bar I defined?” | RAGAS, DeepEval |
| 2 — Pre-production | “What can an adversary do to this?” | promptfoo, inspect_ai |
| 3 — Production | “Is quality holding in the real world?” | Langfuse |
| 4 — Compliance/audit | “Can I prove this system is safe?” | inspect_ai, OpenAI Evals |
Most teams have Stage 1. Most are missing Stages 2 and 3. Almost no team without regulatory pressure has Stage 4.
Stage 0 — Prototype
Goal: Validate that your approach is directionally correct before investing in eval infrastructure.
What “eval” looks like here:
- Manual inspection of 20–50 outputs
- Spot-checking with domain experts
- A/B prompts with human preference ratings
Frameworks useful here: None formally. A notebook with a judge model (GPT-4o or Claude rating outputs 1–5) is sufficient.
When to move to Stage 1: When you’ve decided the approach works and you’re iterating on prompt design or retrieval configuration. Once you’re making changes that require you to verify “did this get better or worse?”, you need measurement — not vibes.
Stage 1 — Development Evals
Goal: Systematic measurement of output quality during active development. Catch regressions before they reach production.
For RAG systems → RAGAS
RAGAS is purpose-built for RAG pipelines. The metrics it provides — context precision, context recall, faithfulness, answer relevancy — directly measure whether the retrieval step is working and whether the generation is grounded in retrieved context.
pip install ragas
Typical eval loop:
- Define a golden dataset: 50–200 question/ground-truth/context triples
- Run your RAG pipeline on each sample
- Score with RAGAS metrics
- Set thresholds: e.g., faithfulness > 0.85, context precision > 0.75
- Run in CI on every retrieval or prompt change
Watch for: The v0.3 → v0.4 migration was breaking. If onboarding, start with v0.4 directly.
Cost: ~$0.50–$5 per 100 samples with GPT-4o as judge (most metrics require an LLM judge). NonLLM-based context precision is free but limited to binary cases.
For general LLM (conversational, agents, summarization) → DeepEval
DeepEval provides pytest-style LLM unit testing. You define test cases with inputs, expected outputs (or rubrics), and thresholds. Tests fail if metrics fall below threshold — same pattern as unit tests.
pip install deepeval
Typical eval loop:
from deepeval.metrics import GEval, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval import assert_test
def test_customer_support_response():
test_case = LLMTestCase(
input="How do I cancel my subscription?",
actual_output=run_my_llm("How do I cancel my subscription?"),
expected_output="Cancellation is available in Settings > Account > Cancel."
)
assert_test(test_case, [AnswerRelevancyMetric(threshold=0.8)])
What DeepEval does well: Multi-metric testing in one framework, pytest CI integration, G-Eval (custom rubric) for open-ended tasks.
What to watch for: PostHog telemetry is on by default (opt-out: deepeval login --telemetry-opt-out). Enterprise deployments should review the data privacy docs before onboarding.
Can I use RAGAS and DeepEval together?
Yes. The common pattern:
- RAGAS for retrieval-specific metrics: context precision, context recall, faithfulness
- DeepEval as the test framework: wraps RAGAS scores into CI-runnable assertions
You can pass RAGAS metric scores into DeepEval test cases. This gives you retrieval-quality metrics (RAGAS) inside a structured test runner (DeepEval).
Stage 2 — Pre-Production Adversarial Testing
Goal: Find what an adversary — or a badly worded production prompt — can make your system do before your users find it.
This stage is almost universally skipped by teams without a security mandate. That’s a mistake: prompt injection, PII leakage, and jailbreak vulnerabilities are real and discoverable before launch.
For systematic red teaming → promptfoo
promptfoo generates adversarial test cases automatically across attack categories: prompt injection, jailbreaks, PII extraction, indirect RAG injection, OWASP LLM Top 10.
# promptfooconfig.yaml
redteam:
purpose: "Customer support chatbot for a fintech app"
plugins:
- owasp:llm
- pii:direct
- harmful:financial-crime
strategies:
- jailbreak
- prompt-injection
npx promptfoo redteam run
Time to first red team run: 30–60 minutes. No Python required.
Cost: ~$1–$10 per campaign depending on attack scope and model used for generation.
For EU-regulated, government, or high-stakes contexts → inspect_ai
promptfoo and inspect_ai both do red teaming. The difference:
- promptfoo produces a developer-facing pass/fail report
- inspect_ai produces audit-grade structured logs (
.evalformat) with full input/output attribution, suitable for regulatory documentation
If you’re building for a regulated industry (fintech, healthtech, government) or preparing for EU AI Act conformity assessment, the output of your red teaming needs to be documentable. inspect_ai’s log format is purpose-built for this.
pip install inspect-ai
inspect eval inspect_evals/src/inspect_evals/agentharm/agentharm.py \
--model anthropic/claude-3-5-sonnet-20241022
Should I run both?
If you have the capacity: yes. A practical split:
- promptfoo for fast iteration and broad attack surface coverage during development
- inspect_ai for a structured, documentable safety eval run before deployment
For most teams, starting with promptfoo in Stage 2 is the pragmatic choice. Add inspect_ai when compliance requirements create the need for audit-grade evidence.
Stage 3 — Production Observability
Goal: Continuous measurement of output quality on real production traffic. Catch drift, degradation, and edge cases that your pre-launch eval dataset didn’t include.
→ Langfuse
Langfuse traces every LLM inference in production: inputs, outputs, latency, token cost, model used. You can attach scores to any trace — either automatically (using an LLM judge on a sample) or via human annotation pipelines.
from langfuse.decorators import observe
@observe()
def my_llm_pipeline(user_message: str) -> str:
# All LLM calls inside are automatically traced
...
What Langfuse gives you that pre-production eval doesn’t:
- Real user inputs, not synthetic datasets
- Longitudinal quality trends (is faithfulness drifting down over 30 days?)
- Cost visibility per query, per user, per session
- Input that feeds back into your Stage 1 golden dataset
EU data residency: Langfuse’s EU Cloud (cloud.langfuse.com) is hosted in Germany (AWS eu-central-1). GDPR-compliant, DPA available. Self-hosting also available for full data sovereignty.
Stage 4 — Compliance and Safety Audit
Goal: Produce documented evidence that your AI system was evaluated against safety and compliance requirements. Needed for EU AI Act conformity assessment, enterprise procurement security reviews, and regulated industry certifications.
→ inspect_ai (primary)
inspect_ai’s .eval log format is designed for this purpose: structured, timestamped, model-attributed, and archivable. The UK AISI (which maintains inspect_ai) is involved in developing the international AI safety standards that will become the harmonised standards referenced in EU AI Act Article 40. Using inspect_ai keeps your evaluation methodology aligned with the emerging regulatory standard.
Recommended evals to run pre-deployment for high-risk AI systems:
| Eval | From inspect_evals | What it covers |
|---|---|---|
| AgentHarm | ✓ | 110 adversarial behaviors, 11 harm categories |
| HarmBench | ✓ | Broad harmful behavior |
| CyberSecEval | ✓ | Security risks |
| Domain-specific | Write custom Task | Your specific regulatory context |
Pin to commit: inspect_ai has no semver. For audit reproducibility, pin to a specific commit in your requirements file.
→ OpenAI Evals (for registry-based benchmark evidence)
If your domain has publicly available benchmark evals (or you can contribute one), running against the OpenAI Evals registry provides citable, reproducible evidence of model quality on published benchmarks.
This is particularly relevant for EU AI Act Annex IV (Technical Documentation): a published eval in the registry that tests compliance-relevant knowledge is citable evidence of pre-deployment evaluation.
Recommended Stacks by Context
Minimum viable eval stack (startup, RAG product)
Stage 1: RAGAS → retrieval quality measurement
Stage 2: promptfoo (basic) → red team before first public launch
Stage 3: Langfuse → production tracing and cost visibility
Full eval stack (growth-stage, EU market)
Stage 1: RAGAS + DeepEval → metric depth + CI regression
Stage 2: promptfoo + inspect_ai → broad adversarial + audit-grade safety
Stage 3: Langfuse → production observability (EU Cloud)
Stage 4: inspect_ai → pre-deployment safety certification
Regulated enterprise (fintech, healthtech, govtech)
Stage 1: DeepEval → pytest-integrated CI eval suite
Stage 2: inspect_ai → government-grade safety eval with audit logs
Stage 3: Langfuse (self-hosted) → full data sovereignty, GDPR-compliant tracing
Stage 4: inspect_ai + OpenAI Evals → conformity assessment documentation
Quick Diagnostic
If you’re not sure which stage you’re at:
-
Can you answer “what metric determines if my LLM output is good enough?” — If not, you’re at Stage 0. Don’t pick a framework yet; define the metric first.
-
Do you have a golden dataset of 50+ examples with expected outputs? — If not, start there before instrumenting anything.
-
Have you run a single adversarial test against your system? — If not, add promptfoo before shipping. It takes 30 minutes and often finds something.
-
Do you know what your production outputs look like? — If you’re guessing based on your test set, you need Langfuse. Real traffic is always more interesting than synthetic datasets.
-
Is your system classified as high-risk under EU AI Act Annex III? — If yes (loan decisions, biometric identification, safety-critical infrastructure, etc.), inspect_ai Stage 4 is not optional.
| → Back to Decision Guide | By Use Case | By Team Type |