Evaluation and Guardrails

Distinguish acceptable judgment from dangerous judgment via evaluation criteria and safety rules.

Context

When agents decide and act autonomously, after-the-fact quality review is not enough. As the system grows, unclear boundaries between "good execution" and "unsafe execution" widen quality variance and make incident root-cause analysis difficult.

Problem

Without evaluation criteria you cannot compare agent output quantitatively, and the impact of a module swap or a prompt change is unmeasurable. Without guardrails you cannot block, before the fact, the agent exceeding its cost budget, hitting an unauthorized external API, or leaking sensitive data.

Forces

Strict evaluation is safe but limits agent autonomy; lax evaluation raises risk.
Up-front guardrails prevent harm but can block legitimate execution; post-hoc evaluation is hard to roll back when something has already happened.
Automated evaluation is fast but misses subtle quality differences; human evaluation is accurate but slow.

Solution

Equip every agent and module with both quantitative evaluation criteria (success rate, response-quality score, cost efficiency) and safety guardrails (cost ceiling, authority scope, denylist of actions). Guardrails run before execution; evaluation runs after. If evaluation falls below threshold, apply automated alerts or execution-halt policies. For non-deterministic agents, distinguish pass@k (at least one success in k tries) from pass^k (every one of k tries succeeds). pass@k suits tool-style use; pass^k is the key metric for customer-facing services because consistency matters. Split evaluation into capability evals (low pass rate, hard tasks) and regression evals (must stay at 100%, detect regressions). When a capability eval consistently passes at high rates, "graduate" it to regression — this graduation is the signal that the boundary has stabilized in the OCLS loop.

Judgment question

By what criteria do we measure quality and risk?

Application scenario

Illustrative scenario — figures and company names in this page are hypothetical for explaining the pattern, not measured data.

The QA Agent evaluates Response Agent output. Criteria: relevance score ≥ 0.8, tone consistency ≥ 0.7, hallucination detection = false. Pre-execution guardrails: block if PII (national ID, card number) is present, require summarization if the response exceeds 500 characters, validate external URLs against an allowed-domain list. Post-execution evaluation: track daily quality distribution and alert ops if the mean falls below 0.75. With this structure, a prompt change can be A/B-tested and its quality delta measured directly.

How it breaks

Operating without criteria produces the recurring "customer complaints are up but we don't know why." Operating without guardrails produces incidents — the agent ships card numbers in a response, or invents a refund policy that doesn't exist — discovered only after the fact. Either case eats trust in the system quickly.

Implementation pattern bridge

Generator-Critic
Evaluator-Optimizer

A reviewer agent evaluates the generator agent's output. Up-front guardrails block before execution; post-hoc evaluation is realized through the Generator-Critic loop.

Academic References

Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems — arXiv 2512.12791
AI Governance by Design for Agentic Systems — Preprints.org

Related patterns

Module ContractDeclare the conditions, authority, and failure paths of every execution unit.
Human ApprovalKeep high-cost, high-risk, high-impact decisions inside a human-approval flow.
Decision TraceabilityRecord judgment rationale, choices, and collaboration paths as structured logs.