Pre-Launch Eval Plan Template

Write this before your team starts instrumenting anything. The plan defines what “good enough” means before you build the scaffolding to measure it.

How to Use This Template

Fill in Sections 1–4 before writing any eval code. These sections define what you’re measuring and why — decisions that are hard to reverse once infrastructure is built.
Fill in Sections 5–6 after your first eval run. Baselines require an actual run; thresholds are set relative to baselines.
Get sign-off before shipping. Section 7 is the gate. If the plan isn’t signed off, the feature isn’t launched.
Archive this document alongside your model release artifacts. For EU AI Act Annex IV, this document is part of your technical documentation.

Document Header

Field	Value
Feature name
Product / system
Author
Date
Version	v1.0
EU AI Act scope	☐ Annex III high-risk / ☐ Out of scope / ☐ Uncertain — flag for legal review
Status	☐ Draft / ☐ In review / ☐ Approved / ☐ Active / ☐ Superseded

Section 1 — Feature Overview

What is this feature? 2–3 sentences. What does it do, who uses it, what decisions does it affect?

Example: An AI assistant that processes employee CVs and surfaces a shortlist of candidates for hiring managers. The assistant uses RAG over the company’s job description corpus to generate relevancy scores and a short narrative for each applicant. Hiring managers make the final decision; the assistant does not decide.

Intended users:

Deployment context: (geography, languages, user types, integration points)

Is a human in the loop? ☐ Yes — human makes all final decisions
☐ Partial — human reviews high-impact decisions only
☐ No — system acts autonomously

If Partial or No: what is the maximum consequence of a wrong output?

Section 2 — What We’re Measuring

List every quality dimension you care about. For each, answer: what does “good” look like for this feature?

Quality dimension	Why it matters for this feature	How we’ll measure it
Faithfulness / grounding
Answer relevancy
Retrieval precision
Retrieval recall
Hallucination rate
Bias / fairness
Adversarial robustness
Latency (P50 / P99)
Cost per query
[Domain-specific metric]

Metrics we explicitly decided NOT to measure and why:

Metric skipped	Reason

Documenting skipped metrics is as important as documenting chosen ones. An auditor will ask.

Section 3 — Evaluation Dataset

Dataset source: ☐ Synthetic (generated from corpus) — tool used: ___
☐ Real user queries (historical) — collection period: ___
☐ Expert-authored — authored by: ___
☐ Hybrid — describe: ___

Dataset size: _____ samples

Dataset coverage:

Dimension	How it’s represented
Geographic scope (if relevant)
Language(s)
User type / persona
Edge cases
Demographic diversity (if relevant)

Dataset version: v____
Storage location: ___
Access control: ___

Known gaps in this dataset:

Plan to expand or update dataset: (e.g., “add real user queries after 2 weeks in production”)

Section 4 — Framework Selection

Primary eval framework(s):

Framework	Role	Why chosen over alternatives

Frameworks considered and rejected:

Framework	Reason not chosen

Eval run location: ☐ Local (developer machine)
☐ CI/CD pipeline (framework: __)
☐ Dedicated eval infrastructure
☐ Third-party cloud (provider: __, data residency: ___)

LLM judge model: Model: ___
Provider: ___
Data sent to provider: ☐ Inputs only / ☐ Inputs + outputs / ☐ Inputs + outputs + context

If EU data residency required: confirm judge model provider is EU-compliant or use a self-hosted judge.

Section 5 — Baseline

Complete after first eval run. Do not set thresholds before you have a baseline.

Baseline run date: ___
Model version evaluated: ___
Eval framework version: ___

Metric	Baseline score	Notes

Baseline interpretation: Is this baseline acceptable? What does it reveal about the current model state?

Comparison to previous version (if applicable):

Metric	Previous	Baseline (current)	Delta

Section 6 — Pass/Fail Thresholds

Set thresholds based on baseline + acceptable tolerance. Every threshold needs an owner and a defined consequence.

Metric	Threshold (min/max)	Breach consequence
Faithfulness	≥ ___	☐ Block deploy / ☐ Alert / ☐ Log
Answer relevancy	≥ ___	☐ Block deploy / ☐ Alert / ☐ Log
Hallucination rate	≤ ___	☐ Block deploy / ☐ Alert / ☐ Log
Adversarial pass rate	≥ ___	☐ Block deploy / ☐ Alert / ☐ Log
Latency P99	≤ ___ms	☐ Block deploy / ☐ Alert / ☐ Log
[Domain metric]

Threshold review process: Who has authority to change a threshold, and what approval is required?

Threshold documentation location: (thresholds should live in version-controlled code, not just this doc)

# Example: thresholds.py — version controlled alongside eval code
THRESHOLDS = {
    "faithfulness": 0.85,
    "answer_relevancy": 0.75,
    "hallucination_rate_max": 0.10,
    "adversarial_pass_rate": 0.90,
}

Section 7 — Production Monitoring Plan

What happens after launch?

Monitoring framework: ___
Deployment: ☐ Cloud (provider: __, region: __) / ☐ Self-hosted

Sampling rate for production scoring: Scoring every inference is expensive. What % of production traffic will be scored?
☐ 100% (feasible only at low volume)
☐ ___% random sample
☐ 100% of [specific condition, e.g. “user thumbs-down signals”]

Metrics monitored in production:

Metric	Sampling strategy	Alert threshold	Alert recipient

Data retention period: ___
(For EU AI Act Annex III: recommend minimum 3 years)

Incident definition: At what failure rate or severity does an event become a reportable incident?

Re-evaluation trigger: When will you run a full eval suite again (not just production monitoring)?

☐ On every model/prompt update
☐ Monthly
☐ Quarterly
☐ When production metric drops below ___
☐ Other: ___

Section 8 — Sign-Off

All parties must sign off before the feature ships to production.

Role	Name	Decision	Date	Notes
Product Manager		☐ Approved / ☐ Rejected / ☐ Conditional
Tech Lead / Engineering		☐ Approved / ☐ Rejected / ☐ Conditional
Data / ML Lead (if applicable)		☐ Approved / ☐ Rejected / ☐ Conditional
Compliance / Legal (if EU AI Act scope)		☐ Approved / ☐ Rejected / ☐ Conditional
Security (if adversarial testing required)		☐ Approved / ☐ Rejected / ☐ Conditional

Conditions for conditional approvals:

Ship decision: ☐ Ship / ☐ Hold — reason: ___

Appendix A — Eval Run Log

Track every eval run against this plan. Update after each run.

Run date	Model version	Framework version	Key metric scores	Outcome	Notes
				☐ Pass / ☐ Fail
				☐ Pass / ☐ Fail

Appendix B — Checklist

Quick reference before sign-off meeting.

Section 1–2 (definition):

Feature described in plain language a non-ML stakeholder can understand
Every measured metric has a stated reason for inclusion
Skipped metrics are documented with rationale

Section 3 (dataset):

Dataset is version-controlled and retrievable in 12 months
Dataset coverage documented — no obvious demographic or scenario gaps
If EU AI Act scope: dataset representativeness reviewed against deployment geography

Section 4 (framework):

Framework selection is documented with alternatives considered
Data sent to LLM judge is documented — no surprise PII in context
If EU data residency required: judge model provider confirmed compliant

Section 5–6 (baseline + thresholds):

Baseline established from an actual eval run, not estimated
Every threshold has a breach consequence defined
Thresholds are in version-controlled code, not only in this doc

Section 7 (monitoring):

Production monitoring is instrumented and tested before launch
Alert recipients are confirmed and have acknowledged their role
Data retention period is set and meets any applicable regulatory requirement

Section 8 (sign-off):

All required sign-offs obtained
Conditional approvals have conditions documented and tracked

For EU AI Act Annex III high-risk AI systems: this document, along with eval run logs and .eval archive files, constitutes the “test results” section of your Annex IV technical documentation. Store alongside model artifacts.

→ EU AI Act high-risk checklist · Eval Framework RFP · Decision Guide