Traditional acceptance criteria follow a simple pattern: given a specific input, the system must produce a specific output. A login form with valid credentials must redirect to the dashboard. A payment with sufficient balance must return a confirmation. These criteria are binary and verifiable. Agentic systems break this pattern entirely. Given the same input on different runs, an agent may produce meaningfully different outputs — not because it is malfunctioning, but because probabilistic reasoning is how it works. Managers who write binary acceptance criteria for agentic systems are setting up projects to fail the moment someone tests edge cases.

Acceptance criteria for probabilistic systems must operate at the population level, not the individual output level. Instead of "the agent must produce the correct answer," the criterion becomes "the agent must produce an acceptable answer at least 85% of the time across a representative sample of inputs." This reframe has three immediate implications. First, you need an evaluation set: a defined collection of representative inputs with known acceptable outputs that can be run consistently. Second, you need to define "acceptable": in most domains, there is a range of outputs that would satisfy a competent human reviewer, not a single correct answer. Third, you need a threshold: the minimum pass rate that makes the workflow viable, derived from the economics of human review rather than from an arbitrary sense of what AI should be capable of.

Ethics enters acceptance criteria when the distribution of errors is not uniform. An agent that achieves 92% accuracy overall may be performing at 75% accuracy for one demographic group and 97% for another. An aggregate acceptance criterion that ignores this distribution will accept a system that is systematically unfair. Responsible acceptance criteria for agentic systems must include equity checks: performance must be measured across meaningful subgroups, and a performance gap that would be unacceptable in human decision-making is equally unacceptable in an agent. This is not a diversity initiative — it is a correctness requirement. A system that performs well on average but harms a specific population is not working correctly.

The practical format for CAMP-level acceptance criteria has four components. Performance threshold: the minimum acceptable rate across the full evaluation set. Equity threshold: the maximum acceptable performance gap across defined subgroups. Failure mode specification: the categories of error that are unacceptable at any frequency (such as incorrect denials in a medical context, or discriminatory decisions in an HR context). Review coverage: the percentage of outputs that must pass through human review and under what conditions that percentage can decrease as confidence in the agent improves. Writing criteria in this format forces the scoping conversation that most agentic projects skip — and prevents the common outcome of a system that passes acceptance testing and fails stakeholder trust.

What this means in practice

The practical implementation question is not whether the idea is interesting. It is how a team turns it into a workflow that can be inspected, repeated, and improved. For this topic, the operating focus is direct: Write acceptance criteria in four-component format that are honest about probabilistic behavior, testable against an evaluation set, and inclusive of equity requirements.

That means the engineering work starts before the first model call. The team must decide what the agent is allowed to know, what it is allowed to do, what evidence it must produce, and which actions require a human decision. This is the difference between an impressive demo and a system that can survive real users, changing inputs, and production constraints.

A credible implementation also includes a feedback path. Every agent run should leave behind enough context for another engineer to answer four questions: what goal was attempted, what context was used, which tools were called, and why the system believed the task was complete. If those questions cannot be answered from logs, traces, or structured outputs, the agent is still operating as a black box.

Reference Diagram

A simple architecture to reason from

Use this diagram as a starting point, not as a universal blueprint. The important move is to make the stages visible. Once stages are visible, you can assign owners, define contracts, set permissions, measure quality, and decide where human review belongs.

Workflow Map

Read left to right: state moves through controlled boundaries.

7 stages

Performance Threshold

Minimum acceptable rate across the full evaluation set.

Equity Threshold

Maximum acceptable performance gap across defined subgroups.

Failure Mode Specification

Categories of error unacceptable at any frequency.

Review Coverage

% of outputs requiring human review + conditions for reduction.

Evaluation Set

Representative inputs with known acceptable outputs.

Subgroup Analysis

Disaggregate results by meaningful subgroup before accepting.

Pass / Fail Decision

All four components must pass — not just aggregate accuracy.

Performance Threshold

Minimum acceptable rate across the full evaluation set.

Equity Threshold

Maximum acceptable performance gap across defined subgroups.

Failure Mode Specification

Categories of error unacceptable at any frequency.

Review Coverage

% of outputs requiring human review + conditions for reduction.

Evaluation Set

Representative inputs with known acceptable outputs.

Subgroup Analysis

Disaggregate results by meaningful subgroup before accepting.

Pass / Fail Decision

All four components must pass — not just aggregate accuracy.

Code Example

Four-component acceptance criteria definition

The example below is intentionally small. Production agentic systems should start with compact contracts like this because small contracts are testable. Once the boundary is working, you can add richer orchestration without losing control of the core behavior.

ts·Four-component acceptance criteria definition

const acceptanceCriteria = {
  // Component 1: Population-level performance threshold
  performanceThreshold: {
    metric: "acceptable_completion_rate",
    minimum: 0.87,
    evaluationSetSize: 500,
    evaluationSetDescription: "Representative sample of production input types",
  },

  // Component 2: Equity threshold
  equityThreshold: {
    maxPerformanceGap: 0.08,  // < 8 percentage points between any two subgroups
    subgroups: ["geography", "demographics", "account_type"],
  },

  // Component 3: Failure mode specification (zero tolerance)
  prohibitedFailures: [
    "incorrect_denial_in_medical_context",
    "discriminatory_classification",
    "pii_exposure_in_output",
  ],

  // Component 4: Review coverage
  reviewCoverage: {
    initialRate: 0.30,  // 30% human review at launch
    reductionCondition: "accuracy stable within 2pp of eval set for 30 days",
    minimumRate: 0.10,  // never go below 10%
  },
};

Illustrative pattern — not production-ready

Implementation notes

Treat these notes as the first design review checklist. They are deliberately concrete because agentic systems fail most often in the gaps between the model, the tools, the data, and the human operating process.

Design note 1

Write criteria at the population level — binary pass/fail per output will fail at edge cases.

Design note 2

Build your evaluation set before writing criteria — the set defines what the threshold means.

Design note 3

Equity checks are correctness requirements, not diversity initiatives — budget them accordingly.

Equity is a correctness requirement

A system that performs well on average but causes systematic harm to a specific population is not working correctly. Disaggregate your evaluation results before accepting any agentic system that makes decisions about people.

Common failure modes

The fastest way to make an article useful is to name how the pattern breaks. These are the failure modes to watch for when a team moves from reading about this idea to deploying it inside a real workflow.

Binary acceptance criteria are written for a probabilistic system — the project fails when edge cases surface.

Aggregate accuracy masks a 20-point performance gap between subgroups that a bias audit would have caught.

Review coverage is set at launch and never reduced — the human review cost makes the business case negative.

Operating checklist

Before this pattern graduates from experiment to production, require a short operating checklist. The checklist should include the owner of the workflow, the allowed tools, the risk rating for each tool, the data sources the agent can use, the completion criteria, the review path, and the rollback plan. If a team cannot fill out that checklist, the workflow is not ready for higher autonomy.

The checklist should also define how the system will be evaluated after launch. Useful metrics include task success rate, human correction rate, average iterations per completed task, cost per successful run, escalation rate, and the number of blocked tool calls. These metrics turn agent quality into an engineering conversation instead of an opinion about whether the output felt good.

Finally, make the learning loop explicit. When the agent fails, decide whether the fix belongs in the prompt, the retrieval layer, the tool contract, the permission model, the evaluation suite, or the human process. Mature agentic engineering is not the absence of failures. It is the ability to classify failures quickly and improve the system without expanding risk.

Key Takeaways

Write acceptance criteria at the population level with defined evaluation sets and explicit thresholds — not binary pass/fail per output.

Equity checks are a correctness requirement: measure performance across meaningful subgroups, not just aggregate accuracy.

Include four components: performance threshold, equity threshold, failure mode specification, and review coverage.

Learn the full system

Build real fluency in agentic engineering.

The Academy turns these concepts into a full curriculum, AI tutor, templates, and the CAE credential path.

Start Learning

Writing Acceptance Criteria for Probabilistic Outputs

What this means in practice

A simple architecture to reason from

Four-component acceptance criteria definition

Implementation notes

Common failure modes

Operating checklist

Build real fluency in agentic engineering.

Related articles

The Educational Science Behind the CAE Certification

The Economics of Agentic Engineering

Building and Leading Agentic Teams