Changelog
Tracks material changes to framework profiles as the eval ecosystem evolves.
2026-05-22 — v1.0.0 (Templates, cross-linking, launch-ready)
Added
templates/eval-framework-rfp.md— weighted scoring template (8 criteria × 3 team-type weight sets), pilot protocol, dealbreaker checklist, decision output formattemplates/pre-launch-eval-plan.md— 8-section working document: feature scope, metrics, dataset, framework selection, baseline, thresholds, production monitoring, sign-off gate; Annex IV archival notetemplates/README.md— templates index
Changed
README.md— quick-pick table expanded to 14 rows covering all content; EU AI Act and Templates sections updated (removed “coming soon” caveats); full content listing for all five modules- Framework profiles — cross-links added to RAGAS, DeepEval, promptfoo, inspect_ai pointing to relevant comparison pages and decision guide use-case anchors
Repo is now complete. All 7 planned sessions delivered.
2026-05-22 — v0.5.0 (EU AI Act module)
Added
eu-ai-act/README.md— enforcement timeline, Annex III categories, four articles that create eval obligations (Art. 9/10/13/15/17), audit trail requirements per frameworkeu-ai-act/framework-mapping.md— master mapping table (6 frameworks × 7 Act requirements); article-by-article breakdown with practical guidance; red-flag matrix of what each framework cannot provideeu-ai-act/high-risk-checklist.md— 5-phase pre-deployment checklist; bias testing procedures; inspect_ai run templates; Annex IV documentation package spec; recommended stacks for Tier 1/2/3 risk systems
Key content shipped:
- “No single framework is sufficient for Annex III compliance” — minimum stack is inspect_ai + Langfuse
- Full Article 40 (harmonised standards) analysis — inspect_ai is the only framework whose maintainer is involved in developing those standards
- Red-flag matrix: what each framework cannot provide for high-risk AI — the compliance gaps most teams don’t know about
- Concrete Annex IV documentation package template — what files to archive, alongside which artifacts
2026-05-22 — v0.4.0 (Head-to-head comparisons)
Added
comparisons/ragas-vs-deepeval.md— deep comparison: retrieval metrics, CI/CD integration, cost, when to use both together; includes “what RAGAS does that DeepEval doesn’t” and vice versacomparisons/promptfoo-vs-inspect-ai.md— developer speed vs audit-grade output; EU AI Act decision point; when to use bothcomparisons/full-matrix.md— all 6 frameworks × 10 criteria; dimension deep-dives; recommended stacks; “what’s missing from all of them”comparisons/README.md— index page
Key content shipped:
- “RAGAS vs DeepEval are not competitors — most RAG teams use both” structured clearly with concrete integration pattern
- EU AI Act decision point for promptfoo vs inspect_ai:
We ran AgentHarm via inspect_aicarries regulatory weight thatWe ran promptfoodoes not - Full matrix includes “what no single framework covers” section (business metrics, multilingual depth, fine-grained attribution)
2026-05-22 — v0.3.0 (Decision guides)
Added
decision-guide/by-lifecycle-stage.md— maps framework choice to dev → pre-production → production → compliance stages; recommended stacks for startup, enterprise, regulated EUdecision-guide/by-team-type.md— team-context guides for 3-person RAG startup, enterprise ML platform, regulated EU company, research/safety teamdecision-guide/by-use-case.md— 9 use cases with primary recommendation, runner-up, and when to pick eachdecision-guide/README.md— 5-minute decision framework and guide index
Key content shipped:
- “Most teams need a stack, not a single framework” — lifecycle stage mapping makes this concrete
- Recommended stacks by context: minimum viable eval, full stack (EU market), regulated enterprise
- Quick diagnostic: 5 questions to determine what stage and what frameworks you need
2026-05-20 — v0.2.0 (Complete framework coverage)
Added
- Framework profiles: promptfoo, Langfuse, inspect_ai, OpenAI Evals
- promptfoo: OWASP LLM Top 10 red teaming coverage; OpenAI acquisition note (March 2026, remains MIT)
- Langfuse: EU Cloud data residency detail (cloud.langfuse.com, AWS eu-central-1, Germany); DPA link
- inspect_ai: UK AISI institutional authority section; inspect_evals library; no-semver pin-to-commit guidance
- OpenAI Evals: PM contribution opportunity section; YAML+JSONL two-file model
All 6 framework profiles now complete.
In progress (upcoming sessions)
- Decision guides: by-use-case, by-team-type, by-lifecycle-stage
- Head-to-head comparisons: RAGAS vs DeepEval, promptfoo vs inspect_ai
- EU AI Act module
2026-05-20 — v0.1.0 (Initial release)
Added
- Repository scaffold with full structure
- README with quick-pick decision table
- Framework profiles: RAGAS, DeepEval
- frameworks/README.md — how to read profiles
- CONTRIBUTING.md
Framework Version Reference
Last verified: 2026-05-20
| Framework | Current stable | Notes |
|---|---|---|
| RAGAS | v0.4.x | v0.3 → v0.4 migration required; breaking changes in metric API |
| DeepEval | v1.x | Confident AI cloud integration added in recent releases |
| promptfoo | v0.121.x | Acquired by OpenAI March 2026; remains MIT-licensed OSS |
| Langfuse | v3.x | Self-host option available; EU Cloud at cloud.langfuse.com |
| inspect_ai | Active | No semver; pin to a commit for reproducible evals |
| OpenAI Evals | Active | YAML-first format; no code required for basic evals |