promptfoo
The most capable open-source tool for red teaming LLMs and running systematic adversarial tests — finds failure modes before your users do.
GitHub: promptfoo/promptfoo · ~21k stars · MIT
Docs: promptfoo.dev
Note: Acquired by OpenAI in March 2026. Remains MIT-licensed open source.
At a Glance
| Dimension | Rating |
|---|---|
| Setup effort | ● ● ○ ○ ○ |
| Coding required | ● ○ ○ ○ ○ (YAML-first) |
| RAG evaluation | ● ● ○ ○ ○ |
| Conversational / general LLM | ● ● ● ○ ○ |
| Red teaming / adversarial | ● ● ● ● ● |
| Production observability | ○ ○ ○ ○ ○ |
| EU AI Act support | ● ● ● ○ ○ |
| Cost at scale | Low–Medium (config-driven, judge model optional) |
Best For
- Red teaming LLMs — systematic adversarial testing across hundreds of attack categories: prompt injection, jailbreaks, PII extraction, OWASP LLM Top 10
- Prompt regression testing — comparing outputs across model versions, prompt variants, or configuration changes before deploying
- No-code eval setup — YAML configuration means non-engineers can write and run evals without Python
- CI/CD prompt quality gates — assertion-based pass/fail pipeline integration
- Multi-model comparison — run the same prompt suite against GPT-4o, Claude, Gemini simultaneously and compare outputs side-by-side
Not Great For
- RAG-specific metric depth — no faithfulness, context precision, or retrieval-specific metrics; use RAGAS for RAG quality
- Production monitoring — no trace collection or real-time observability; it’s a pre-production testing tool
- Teams that need Python-native testing — DeepEval’s pytest integration is more natural for Python-heavy teams
- Long-horizon agentic evaluation — tool call quality and multi-step agent behavior aren’t first-class concerns
How It Works
promptfoo is configuration-first. A basic eval is a YAML file:
# promptfooconfig.yaml
prompts:
- "Summarize this document: "
providers:
- openai:gpt-4o
- anthropic:claude-3-5-sonnet-20241022
tests:
- vars:
document: "The Eiffel Tower was built in 1889..."
assert:
- type: contains
value: "1889"
- type: llm-rubric
value: "Summary is accurate and under 100 words"
- type: not-contains
value: "I cannot"
Run with: npx promptfoo eval
The output is a web UI comparison table showing each model’s response against every assertion.
Red Teaming
promptfoo’s red teaming engine is its primary differentiator. It generates adversarial test cases automatically across:
| Attack Category | What it tests |
|---|---|
| Prompt injection | Does the model follow injected instructions in user input? |
| Jailbreaks | Does the model comply with policy-violating requests? |
| PII extraction | Does the model leak personal information it was given? |
| Harmful content | Does the model generate dangerous or illegal content? |
| Indirect injection | Does RAG context contain adversarial instructions the model follows? |
| OWASP LLM Top 10 | Full coverage of the OWASP LLM security taxonomy |
redteam:
purpose: "Customer service assistant for a bank"
plugins:
- owasp:llm
- harmful:violence
- pii:direct
strategies:
- jailbreak
- prompt-injection
This generates hundreds of adversarial test cases automatically and scores pass/fail against each.
Tradeoffs vs. Alternatives
vs. DeepEval Pick promptfoo when your primary concern is adversarial robustness and safety. Pick DeepEval when you want pytest-style quality regression tests. DeepEval has basic safety metrics; promptfoo has a dedicated red teaming engine an order of magnitude more comprehensive.
vs. inspect_ai Both do adversarial/safety testing. The key difference:
- promptfoo is developer-facing, YAML-driven, CI/CD-first, works with any LLM
- inspect_ai is government-grade, carries UK AISI institutional authority, and maps directly to EU AI Act conformity assessment requirements
For a startup or mid-size product team: promptfoo. For a regulated EU company preparing for AI Act audits: inspect_ai (or both, targeting different audiences).
vs. RAGAS Different jobs entirely. Use RAGAS for retrieval quality in RAG systems. Use promptfoo for adversarial robustness testing of any LLM-powered feature.
Integration Effort
Time to first eval: 15–30 minutes
npm install -g promptfoo
promptfoo init
promptfoo eval
Works with: Any LLM provider via provider config — OpenAI, Anthropic, Google, Azure, Bedrock, Ollama, or any HTTP endpoint
Language: YAML config + optional JavaScript/Python for custom providers and assertions
No Python required for basic use
Cost at Scale
| Mode | Cost |
|---|---|
| Assertion-only evals (contains, regex, JSON schema) | Free — no LLM calls for assertions |
| LLM-rubric assertions (llm-rubric type) | ~$0.001–$0.01 per assertion depending on judge model |
| Red team generation | ~$1–$10 per red team run depending on scope |
promptfoo is significantly cheaper than DeepEval or RAGAS at scale because many assertions are deterministic (no LLM judge needed). The red team generation step uses an LLM, but it’s a one-time cost per campaign.
EU AI Act Relevance
Good — particularly strong for Article 9 risk management.
- Article 9 (Risk management system): Red teaming directly maps to the Act’s requirement to identify and test for foreseeable risks. promptfoo’s OWASP LLM Top 10 coverage gives structured, documentable evidence of adversarial risk testing.
- Article 13 (Transparency): The YAML configuration creates a human-readable, auditable record of what was tested and what the acceptance criteria were.
- Annex IV (Technical documentation): promptfoo’s HTML/JSON output reports are structured enough to include in technical documentation.
Key limitation for EU compliance: promptfoo provides no audit trail for production inferences and has no EU data residency controls. It’s a pre-production testing tool — pair with Langfuse for production audit trails and inspect_ai for formal conformity assessment evidence.
OWASP mapping contribution: The community is actively mapping promptfoo plugins to OWASP LLM Top 10 categories (see open issue #6900). Once complete, this will make the EU AI Act → OWASP → promptfoo plugin chain fully traceable.
Version Tracked
- Current stable: v0.121.x
- Last verified: 2026-05-20
- Acquisition note: OpenAI acquired promptfoo in March 2026. The repo remains MIT-licensed and community-maintained. Watch for changes to the commercial offering post-acquisition, but the core CLI is unaffected.
→ promptfoo vs inspect_ai · Full matrix · By Use Case: Red teaming · Back to Framework Profiles