Evaluation and Guardrails
Distinguish acceptable judgment from dangerous judgment via evaluation criteria and safety rules.
Context
When agents decide and act autonomously, after-the-fact quality review is not enough. As the system grows, unclear boundaries between "good execution" and "unsafe execution" widen quality variance and make incident root-cause analysis difficult.
Problem
Without evaluation criteria you cannot compare agent output quantitatively, and the impact of a module swap or a prompt change is unmeasurable. Without guardrails you cannot block, before the fact, the agent exceeding its cost budget, hitting an unauthorized external API, or leaking sensitive data.
Forces
- Strict evaluation is safe but limits agent autonomy; lax evaluation raises risk.
- Up-front guardrails prevent harm but can block legitimate execution; post-hoc evaluation is hard to roll back when something has already happened.
- Automated evaluation is fast but misses subtle quality differences; human evaluation is accurate but slow.
Solution
Equip every agent and module with both quantitative evaluation criteria (success rate, response-quality score, cost efficiency) and safety guardrails (cost ceiling, authority scope, denylist of actions). Guardrails run before execution; evaluation runs after. If evaluation falls below threshold, apply automated alerts or execution-halt policies. For non-deterministic agents, distinguish pass@k (at least one success in k tries) from pass^k (every one of k tries succeeds). pass@k suits tool-style use; pass^k is the key metric for customer-facing services because consistency matters. Split evaluation into capability evals (low pass rate, hard tasks) and regression evals (must stay at 100%, detect regressions). When a capability eval consistently passes at high rates, "graduate" it to regression — this graduation is the signal that the boundary has stabilized in the OCLS loop.
Judgment question
By what criteria do we measure quality and risk?
Application scenario
Illustrative scenario — figures and company names in this page are hypothetical for explaining the pattern, not measured data.
The QA Agent evaluates Response Agent output. Criteria: relevance score ≥ 0.8, tone consistency ≥ 0.7, hallucination detection = false. Pre-execution guardrails: block if PII (national ID, card number) is present, require summarization if the response exceeds 500 characters, validate external URLs against an allowed-domain list. Post-execution evaluation: track daily quality distribution and alert ops if the mean falls below 0.75. With this structure, a prompt change can be A/B-tested and its quality delta measured directly.
How it breaks
Operating without criteria produces the recurring "customer complaints are up but we don't know why." Operating without guardrails produces incidents — the agent ships card numbers in a response, or invents a refund policy that doesn't exist — discovered only after the fact. Either case eats trust in the system quickly.
Implementation pattern bridge
- Generator-Critic
- Evaluator-Optimizer
A reviewer agent evaluates the generator agent's output. Up-front guardrails block before execution; post-hoc evaluation is realized through the Generator-Critic loop.
Academic References
- Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems — arXiv 2512.12791
- AI Governance by Design for Agentic Systems — Preprints.org