Traditional acceptance criteria follow a simple pattern: given a specific input, the system must produce a specific output. A login form with valid credentials must redirect to the dashboard. A payment with sufficient balance must return a confirmation. These criteria are binary and verifiable. Agentic systems break this pattern entirely. Given the same input on different runs, an agent may produce meaningfully different outputs — not because it is malfunctioning, but because probabilistic reasoning is how it works. Managers who write binary acceptance criteria for agentic systems are setting up projects to fail the moment someone tests edge cases.
Acceptance criteria for probabilistic systems must operate at the population level, not the individual output level. Instead of "the agent must produce the correct answer," the criterion becomes "the agent must produce an acceptable answer at least 85% of the time across a representative sample of inputs." This reframe has three immediate implications. First, you need an evaluation set: a defined collection of representative inputs with known acceptable outputs that can be run consistently. Second, you need to define "acceptable": in most domains, there is a range of outputs that would satisfy a competent human reviewer, not a single correct answer. Third, you need a threshold: the minimum pass rate that makes the workflow viable, derived from the economics of human review rather than from an arbitrary sense of what AI should be capable of.
Ethics enters acceptance criteria when the distribution of errors is not uniform. An agent that achieves 92% accuracy overall may be performing at 75% accuracy for one demographic group and 97% for another. An aggregate acceptance criterion that ignores this distribution will accept a system that is systematically unfair. Responsible acceptance criteria for agentic systems must include equity checks: performance must be measured across meaningful subgroups, and a performance gap that would be unacceptable in human decision-making is equally unacceptable in an agent. This is not a diversity initiative — it is a correctness requirement. A system that performs well on average but harms a specific population is not working correctly.
The practical format for CAMP-level acceptance criteria has four components. Performance threshold: the minimum acceptable rate across the full evaluation set. Equity threshold: the maximum acceptable performance gap across defined subgroups. Failure mode specification: the categories of error that are unacceptable at any frequency (such as incorrect denials in a medical context, or discriminatory decisions in an HR context). Review coverage: the percentage of outputs that must pass through human review and under what conditions that percentage can decrease as confidence in the agent improves. Writing criteria in this format forces the scoping conversation that most agentic projects skip — and prevents the common outcome of a system that passes acceptance testing and fails stakeholder trust.
What this means in practice
The practical implementation question is not whether the idea is interesting. It is how a team turns it into a workflow that can be inspected, repeated, and improved. For this topic, the operating focus is direct: Write acceptance criteria in four-component format that are honest about probabilistic behavior, testable against an evaluation set, and inclusive of equity requirements.
That means the engineering work starts before the first model call. The team must decide what the agent is allowed to know, what it is allowed to do, what evidence it must produce, and which actions require a human decision. This is the difference between an impressive demo and a system that can survive real users, changing inputs, and production constraints.
A credible implementation also includes a feedback path. Every agent run should leave behind enough context for another engineer to answer four questions: what goal was attempted, what context was used, which tools were called, and why the system believed the task was complete. If those questions cannot be answered from logs, traces, or structured outputs, the agent is still operating as a black box.
A simple architecture to reason from
Use this diagram as a starting point, not as a universal blueprint. The important move is to make the stages visible. Once stages are visible, you can assign owners, define contracts, set permissions, measure quality, and decide where human review belongs.
Minimum acceptable rate across the full evaluation set.
Maximum acceptable performance gap across defined subgroups.
Categories of error unacceptable at any frequency.
% of outputs requiring human review + conditions for reduction.
Representative inputs with known acceptable outputs.
Disaggregate results by meaningful subgroup before accepting.
All four components must pass — not just aggregate accuracy.
Four-component acceptance criteria definition
The example below is intentionally small. Production agentic systems should start with compact contracts like this because small contracts are testable. Once the boundary is working, you can add richer orchestration without losing control of the core behavior.
const acceptanceCriteria = {
// Component 1: Population-level performance threshold
performanceThreshold: {
metric: "acceptable_completion_rate",
minimum: 0.87,
evaluationSetSize: 500,
evaluationSetDescription: "Representative sample of production input types",
},
// Component 2: Equity threshold
equityThreshold: {
maxPerformanceGap: 0.08, // < 8 percentage points between any two subgroups
subgroups: ["geography", "demographics", "account_type"],
},
// Component 3: Failure mode specification (zero tolerance)
prohibitedFailures: [
"incorrect_denial_in_medical_context",
"discriminatory_classification",
"pii_exposure_in_output",
],
// Component 4: Review coverage
reviewCoverage: {
initialRate: 0.30, // 30% human review at launch
reductionCondition: "accuracy stable within 2pp of eval set for 30 days",
minimumRate: 0.10, // never go below 10%
},
};Implementation notes
Treat these notes as the first design review checklist. They are deliberately concrete because agentic systems fail most often in the gaps between the model, the tools, the data, and the human operating process.
Write criteria at the population level — binary pass/fail per output will fail at edge cases.
Build your evaluation set before writing criteria — the set defines what the threshold means.
Equity checks are correctness requirements, not diversity initiatives — budget them accordingly.
Common failure modes
The fastest way to make an article useful is to name how the pattern breaks. These are the failure modes to watch for when a team moves from reading about this idea to deploying it inside a real workflow.
Operating checklist
Before this pattern graduates from experiment to production, require a short operating checklist. The checklist should include the owner of the workflow, the allowed tools, the risk rating for each tool, the data sources the agent can use, the completion criteria, the review path, and the rollback plan. If a team cannot fill out that checklist, the workflow is not ready for higher autonomy.
The checklist should also define how the system will be evaluated after launch. Useful metrics include task success rate, human correction rate, average iterations per completed task, cost per successful run, escalation rate, and the number of blocked tool calls. These metrics turn agent quality into an engineering conversation instead of an opinion about whether the output felt good.
Finally, make the learning loop explicit. When the agent fails, decide whether the fix belongs in the prompt, the retrieval layer, the tool contract, the permission model, the evaluation suite, or the human process. Mature agentic engineering is not the absence of failures. It is the ability to classify failures quickly and improve the system without expanding risk.
Build real fluency in agentic engineering.
The Academy turns these concepts into a full curriculum, AI tutor, templates, and the CAE credential path.
