Pre-Deployment Eval Checklist for High-Risk AI Systems
For AI systems in EU AI Act Annex III categories. Complete before first production deployment and before each significant model update.
Before Using This Checklist
Step 0: Confirm your system is Annex III high-risk.
Check the eight categories: biometric identification, critical infrastructure, education, employment, essential services, law enforcement, migration/border control, justice/democratic processes.
If you’re not sure: assume high-risk and proceed. The cost of over-compliance is a few extra eval runs; the cost of under-compliance is formal enforcement action.
This checklist is for: AI providers (companies that develop and place the system on the market).
Deployers (companies that operate a third-party AI system) have separate obligations under Chapter III, Section 3 — this checklist doesn’t cover those.
Phase 1 — Pre-Development (Before You Build)
1.1 Risk identification
- Document the intended purpose — specific use case, intended users, deployment context, geographic scope
- Identify foreseeable misuse scenarios — what could go wrong if the system is used outside its intended purpose?
- Identify foreseeable risks — list risks categorised by: probability of occurrence, severity of harm, number of people potentially affected
- Define risk acceptance criteria — what metric score constitutes “acceptable” for each risk category?
Article 9 requires this to be documented and iterative. Create a living risk register.
1.2 Eval dataset planning
- Define the evaluation dataset scope — what user types, geographic contexts, languages, and use case variants must the dataset cover?
- Document dataset construction methodology — how samples were selected, synthetic vs. real, sources used
- Assess representativeness — does the dataset reflect the actual deployment demographic and behavioural context?
- Check for bias in the evaluation set — are any demographic groups systematically underrepresented?
Article 10 requires testing data to be relevant to the intended deployment scope. Dataset methodology is a compliance artifact.
Phase 2 — Development Evals
2.1 Quality baseline (DeepEval or RAGAS)
For RAG systems:
- Faithfulness ≥ threshold defined in risk register (typically 0.85 for high-risk applications)
- Context precision ≥ threshold
- Context recall ≥ threshold
- Answer relevancy ≥ threshold
For general LLM systems:
- Answer relevancy ≥ threshold
- Hallucination rate ≤ threshold
- Bias score within acceptable range
Use DeepEval’s pytest integration so threshold violations block CI. Keep threshold values in version-controlled config.
# thresholds.py — version controlled alongside model artifacts
THRESHOLDS = {
"faithfulness": 0.85,
"context_precision": 0.80,
"answer_relevancy": 0.75,
"hallucination_max": 0.10,
}
2.2 Bias and fairness testing
- Define demographic dimensions relevant to your use case — e.g., gender, age, nationality, language, disability status
- Run DeepEval BiasMetric across demographic variants of your test cases
- Document metric scores per demographic group — not just aggregate scores
- Define acceptable disparity threshold — e.g., no demographic group may score more than 15% below the overall average
- If bias found: document the finding, the root cause analysis, and the mitigation taken
Article 10 and Article 15 both touch on bias. Aggregate metrics hide demographic disparities — always disaggregate.
2.3 Adversarial robustness (promptfoo)
- Define the adversarial threat model — who is the adversary? What are they trying to do?
- Run promptfoo OWASP LLM Top 10 suite
- Run promptfoo with harm categories relevant to your domain:
- Employment decisions →
harmful:discrimination,pii:direct - Financial services →
harmful:financial-crime,pii:indirect - Healthcare →
harmful:medical-advice-without-disclaimer,pii:medical - Government →
harmful:politics,prompt-injection
- Employment decisions →
- Prompt injection test — can user input override system instructions?
- Indirect injection test — if system uses RAG, can adversarial content in retrieved documents affect behaviour?
- Document all findings — vulnerabilities found, severity, remediation taken, re-test result
# promptfooconfig.yaml — version controlled
redteam:
purpose: "HR screening assistant for employment decisions"
plugins:
- owasp:llm
- harmful:discrimination
- pii:direct
- harmful:hate
strategies:
- jailbreak
- prompt-injection
Article 9 (risk management) and Article 15 (robustness against adversarial attacks). The YAML file is your documented test specification.
Phase 3 — Pre-Deployment Safety Evaluation
3.1 Structured safety evaluation (inspect_ai)
This is the audit-grade step. The output of this phase goes into your Annex IV technical documentation.
- Pin inspect_ai to a specific commit:
git+https://github.com/UKGovernmentBEIS/inspect_ai.git@<commit-hash> - Run AgentHarm benchmark (from inspect_evals):
inspect eval inspect_evals/src/inspect_evals/agentharm/agentharm.py \ --model <your-model> \ --log-dir ./eval-logs/pre-deploy/$(date +%Y%m%d)/ - Run domain-relevant benchmarks from inspect_evals:
- CyberSecEval for any system with internet access or code execution
- HarmBench for general safety baseline
- Write custom inspect_ai Tasks for domain-specific risks identified in Phase 1:
@task def employment_fairness_eval(): return Task( dataset=csv_dataset("employment_fairness_scenarios.csv"), solver=generate(), scorer=model_graded_fact() ) - Archive
.evallog files — these are your compliance artifacts - Document summary findings from each eval run in your risk register
Articles 9, 13, 15, 17, Annex IV. The .eval logs are the primary documentary evidence for conformity assessment.
3.2 Annex IV documentation package
Compile the following before deployment. This is the “test results” section of your Annex IV technical documentation:
-
Eval run summary table:
Eval Date Model version inspect_ai commit Score Pass/fail AgentHarm YYYY-MM-DD model-v1.2 abc1234 0.87 Pass CyberSecEval YYYY-MM-DD model-v1.2 abc1234 0.91 Pass Domain eval YYYY-MM-DD model-v1.2 abc1234 0.83 Pass - Archived
.evallog files — stored alongside model artifacts, not just in a compliance folder - promptfoo adversarial report (JSON) — link from risk register
- Bias testing results — disaggregated by demographic dimension
- Dataset documentation — scope, methodology, representativeness assessment
- Risk register — current state, including any residual risks and accepted mitigations
Phase 4 — Production Monitoring
4.1 Langfuse instrumentation
- Instrument all LLM calls with
@observe()decorator or SDK integration - Set up EU-compliant data handling:
- EU Cloud (cloud.langfuse.com, AWS eu-central-1) — sign DPA
- Self-hosted — verify infrastructure is in EU and access controls are documented
- Attach quality scores to production traces — minimum: faithfulness and answer relevancy on sampled traffic (10–20%)
- Configure alert thresholds:
# Alert when faithfulness drops below 0.75 on rolling 24-hour window # Alert when error rate exceeds 5% # Alert when latency p99 exceeds threshold - Configure data retention — minimum retention for compliance: 3 years recommended (check your sector’s specific requirements)
Article 17 (post-market monitoring) and Article 9 (continuous risk management).
4.2 Incident handling
- Define “serious incident” threshold — at what failure rate or severity does an event become an Art. 73 serious incident requiring notification?
- Create incident response procedure — who is notified, what investigation is triggered, what remediation steps
- Verify Langfuse alerting reaches the right people — on-call engineer and compliance owner
Article 73 requires providers to report serious incidents to market surveillance authorities.
Phase 5 — Ongoing (Per Model Update)
5.1 Re-evaluation triggers
Re-run the full Phase 2–3 checklist when:
- Model version changes (including fine-tuning runs)
- System prompt changes substantively
- Retrieval corpus changes by more than 20%
- Significant new use case or user type is added
- Adversarial technique is discovered that your current suite didn’t catch
- A production incident reveals a failure mode not covered by existing evals
5.2 Periodic review
- Quarterly: Review Langfuse quality trends — is performance stable or drifting?
- Every 6 months: Re-run inspect_ai safety evals against latest model version
- Annually: Review and update risk register; re-assess dataset representativeness
Recommended Stacks by Risk Tier
Tier 1 — Highest risk (employment, credit, law enforcement, biometrics)
Full stack. No shortcuts.
Development: RAGAS or DeepEval (quality baseline) + DeepEval bias metrics
Adversarial: promptfoo (broad coverage) + inspect_ai (audit-grade safety)
Production: Langfuse self-hosted (data sovereignty) with 3-year retention
Documentation: inspect_ai .eval logs + Langfuse exports in Annex IV package
Expected setup time: 2–4 weeks for first compliant deployment.
Ongoing cost (1k production queries/day): inspect_ai eval runs quarterly ($20–50/run) + Langfuse self-hosted (infra cost only).
Tier 2 — High risk (education access, essential services, critical infrastructure)
Full stack with some flexibility on tooling choices.
Development: RAGAS or DeepEval (quality baseline)
Adversarial: promptfoo (OWASP LLM Top 10 minimum) + inspect_ai AgentHarm
Production: Langfuse (EU Cloud or self-hosted)
Documentation: inspect_ai .eval logs in Annex IV package
Expected setup time: 1–2 weeks.
Tier 3 — Moderate risk (education tools, general service assistants near Annex III)
Lighter stack; escalate to Tier 2 if classification changes.
Development: DeepEval (CI/CD quality gate)
Adversarial: promptfoo (pre-launch red team, minimum 1 campaign)
Production: Langfuse (EU Cloud acceptable)
Documentation: promptfoo JSON report + DeepEval CI artifacts
Expected setup time: 3–5 days.
What This Checklist Doesn’t Cover
- Deployer obligations — if you are operating a third-party AI system (not building it), you have separate obligations under Art. 26. This checklist is for providers.
- GPAI model compliance — if you are deploying a general-purpose AI model (not a specific-purpose system), the Chapter V rules apply instead.
- Notified body conformity assessment — for some Annex III categories, a third-party conformity assessment by a notified body is required. This checklist prepares you for that assessment; it doesn’t replace it.
- Sector-specific regulation — AI in medical devices (MDR), financial services (DORA), or aviation operates under additional sector rules that interact with the AI Act.
This checklist reflects EU AI Act OJ L 2024/1689. It is practical guidance, not legal advice. Engage a notified body and legal counsel for formal conformity assessment.