promptfoo vs inspect_ai

Both do red teaming and adversarial testing. The difference is who the output is for and what it needs to prove.


The One-Line Version

promptfoo finds vulnerabilities before your users do — fast, YAML-driven, developer-facing.
inspect_ai proves a system was safely evaluated — rigorous, audit-grade, government-backed.

The question isn’t which is better. It’s which audience needs to be convinced. For a developer iteration loop: promptfoo. For a regulator or enterprise auditor: inspect_ai.


At a Glance

Dimension promptfoo inspect_ai
Maintained by Community (acquired by OpenAI, March 2026) UK AI Safety Institute (gov)
Primary audience Developers, security teams Safety researchers, compliance teams
Configuration YAML-first, no Python required Python (required)
Setup time 15–30 minutes 1–3 hours
Output format HTML/JSON report, web UI .eval structured log (audit-grade)
Adversarial coverage ● ● ● ● ● ● ● ● ● ○
Audit trail quality ● ● ○ ○ ○ ● ● ● ● ●
EU AI Act evidence value ● ● ● ○ ○ ● ● ● ● ●
Multi-turn agentic eval ● ○ ○ ○ ○ ● ● ● ● ○
Published safety benchmarks ✓ (inspect_evals)
CI/CD integration ✓ (native) Manual
Self-host capable
Institutional authority None UK AISI
GitHub stars ~21k ~2k
License MIT MIT

What promptfoo Does Better

Speed and breadth of adversarial coverage

promptfoo’s red team engine is the most comprehensive automated attack generation available in open source. From a single YAML config, it generates hundreds of adversarial test cases across:

  • Prompt injection (direct and indirect via RAG context)
  • Jailbreaks (role-play, encoding, DAN variants)
  • PII extraction (direct ask, indirect inference)
  • Harmful content (12 categories including violence, CSAM detection, financial crime)
  • OWASP LLM Top 10 — full coverage with plugin-per-category
  • Indirect injection — adversarial content embedded in retrieved documents
redteam:
  purpose: "Customer-facing loan application assistant"
  plugins:
    - owasp:llm
    - pii:direct
    - pii:indirect
    - harmful:financial-crime
    - harmful:hate
  strategies:
    - jailbreak
    - jailbreak:tree
    - prompt-injection

This generates 200–500 test cases automatically. A team with no security background can run this in 30 minutes and get a structured vulnerability report.

inspect_ai’s AgentHarm benchmark covers 110 adversarial behaviors across 11 harm categories — excellent coverage for published benchmarks, but it’s a fixed set. promptfoo’s generation is open-ended and customisable per application context.

Developer experience

promptfoo is designed to be used continuously during development, not just at release gates:

  • No Python required for basic use — YAML config and npx promptfoo redteam run
  • Web UI comparison table — immediate visual output
  • CI/CD integration — runs as a pipeline step with pass/fail exit codes
  • Multi-model comparison — run the same attack suite against GPT-4o vs Claude simultaneously

The iteration loop is fast enough to run on every PR that touches prompt logic.

Cost

promptfoo’s attack generation uses an LLM once (to create the test cases), then re-runs the same cases deterministically. A full red team campaign costs $1–10 depending on scope and judge model. Re-running against a new model version is essentially free.


What inspect_ai Does Better

Audit-grade output

This is the core difference. promptfoo produces a developer-readable report. inspect_ai produces an .eval log — a structured JSON archive containing:

  • Full configuration (model, task, parameters)
  • Every sample: input, model output, scorer output, score
  • Timestamps, model attribution
  • Reproducible: run the same .eval file against the same commit six months later
inspect view  # opens the log in a browser-based viewer
inspect list runs  # all previous eval runs

This log format is purpose-built for inclusion in regulatory documentation. An auditor can open an .eval file and reconstruct exactly what was tested, when, with what model, and what the result was. promptfoo’s JSON output is readable but not structured to this standard.

Institutional authority

inspect_ai is maintained by the UK AI Safety Institute — the government body responsible for frontier AI safety evaluations. Anthropic, Google DeepMind, and Meta AI have used inspect_ai evals as part of their pre-deployment safety testing. The UK AISI publishes evaluation reports built on inspect_ai.

This matters for a specific, important reason: EU AI Act Article 40 references harmonised standards. The AISI is involved in developing the international AI safety standards that will eventually be those harmonised standards. Using inspect_ai positions your evaluation methodology to be aligned with whatever those standards become. Using promptfoo does not.

inspect_evals: peer-reviewed published benchmarks

The companion repo inspect_evals contains evaluations that have been reviewed and published by the UK AISI:

Eval What it tests
AgentHarm 110 adversarial behaviors, 11 harm categories
HarmBench Broad harmful behavior across multiple attack vectors
CyberSecEval Cybersecurity knowledge, vulnerability generation risks
GAIA Multi-step tool-use reasoning (Level 1–3)
SWE-bench Verified Real GitHub issue resolution by agents

Running AgentHarm against your model and including the .eval log in your technical documentation is the most defensible safety evidence currently available to product teams. The benchmarks appear in Anthropic and DeepMind pre-deployment reports.

Multi-turn agentic evaluation

inspect_ai gives the model access to real tools — web browser, bash, file operations — and evaluates whether it uses them correctly across multi-turn scenarios. This is what you need for evaluating AI agents that take real-world actions.

@task
def agentic_safety_eval():
    return Task(
        dataset=csv_dataset("safety_scenarios.csv"),
        solver=[use_tools(web_browser(), bash()), generate()],
        scorer=model_graded_fact()
    )

promptfoo’s agentic evaluation is limited. Multi-turn adversarial scenarios are possible but not first-class.


The EU AI Act Decision Point

This is where the two frameworks diverge most sharply in practice.

Context Recommendation
Startup, pre-launch, no regulatory pressure promptfoo — fast, comprehensive, no Python required
Growth-stage company with enterprise customers promptfoo in dev + inspect_ai before major releases
EU-regulated industry (fintech, health, gov) inspect_ai primary — audit logs in Annex IV docs
Preparing EU AI Act conformity assessment inspect_ai mandatory — Article 9 risk management evidence
Government or defence contractor inspect_ai — institutional authority required
Publishing safety research inspect_ai — comparable to frontier lab reports

The EU AI Act’s Article 9 requires a documented risk management system. “We ran promptfoo” is evidence of adversarial testing. “We ran AgentHarm via inspect_ai and here are the .eval logs” is evidence of systematic safety evaluation using the same methodology as Anthropic and DeepMind. For a regulator, those are not equivalent.


Versioning

promptfoo: Semantic versioning. Current stable: v0.121.x. Upgrade normally.
inspect_ai: No semver. Pin to a specific commit for reproducible evals.

# requirements.txt — for audit reproducibility
git+https://github.com/UKGovernmentBEIS/inspect_ai.git@<commit-hash>

This matters: if you’re including inspect_ai eval results in Annex IV technical documentation, you need to prove the methodology didn’t change between runs. The commit hash is your proof.


Can You Use Both?

Yes, and it’s the recommended pattern for regulated teams:

promptfoo: continuous red teaming during development
           → runs on every PR, catches new vulnerabilities fast
           → developer-facing report, no regulatory weight

inspect_ai: structured safety evaluation at release gates
            → runs AgentHarm + domain-specific evals before deploy
            → .eval logs included in Annex IV documentation

This gives you fast feedback (promptfoo) and audit-grade evidence (inspect_ai) without needing to choose.


The Verdict

If your question is… Answer
“Find vulnerabilities before launch, fast” promptfoo
“Prove this system is safe to a regulator” inspect_ai
“Run OWASP LLM Top 10 coverage” promptfoo
“Run AgentHarm (110 published behaviors)” inspect_ai
“I need CI/CD adversarial gate, no Python” promptfoo
“We’re preparing an EU AI Act conformity assessment” inspect_ai
“3-person startup, first red team ever” promptfoo
“Enterprise selling into regulated EU markets” Both

promptfoo profile · inspect_ai profile · Full matrix · Back to comparisons