Pattern Skills

reopt-evaluation-and-guardrails

Distinguish acceptable judgment from dangerous judgment via evaluation criteria and safety rules.

OCLS SHARPEN

Install

cp skills/patterns/reopt-evaluation-and-guardrails/SKILL.md <your-project>/.claude/skills/reopt-evaluation-and-guardrails/SKILL.md

Copy this repo's file into your project and the resource activates in your Claude Code session immediately.

markdownskills/patterns/reopt-evaluation-and-guardrails/SKILL.md
---
name: reopt-evaluation-and-guardrails
description: When agent output needs quality and safety validation. Deciding whether to place guardrails before or after execution. Designing an evaluation pipeline, hallucination detection, blocking forbidden actions.
---

# Evaluation and Guardrails

OCLS phase: **SHARPEN** · Distinguish acceptable judgment from dangerous judgment via evaluation criteria and safety rules.

## Core rules

- Equip every agent and module with both quantitative evaluation criteria (success rate, response-quality score, cost efficiency) and safety guardrails (cost ceiling, authority scope, denylist of actions).
- Guardrails run before execution; evaluation runs after.
- If evaluation falls below threshold, apply automated alerts or execution-halt policies.
- For non-deterministic agents: use pass@k (at least one success in k tries) for tool-style use; use pass^k (every one of k tries succeeds) for customer-facing services that demand consistency.
- Split evaluation into capability eval (low pass rate, hard tasks) and regression eval (must stay at 100%, detects regressions). When capability stabilizes, "graduate" it to regression.

## Judgment question

**By what criteria do we measure quality and risk?**

## Application check

1. Are the quantitative evaluation criteria (scores, thresholds) defined?
2. Are both up-front guardrails and post-hoc evaluation in place?
3. Is there an automated action (alert, block) on threshold breach?
4. Is the boundary between capability eval and regression eval declared?

## Code example

```typescript
// Up-front guardrail (before execution)
function preGuard(input: RequestInput): GuardResult {
  if (detectPII(input.text)) return { block: true, reason: "PII detected" };
  if (input.text.length > 5000) return { block: true, reason: "input too long" };
  if (estimateCost(input) > 0.5) return { block: true, reason: "cost cap" };
  return { block: false };
}

// Post-hoc evaluation (after execution)
interface EvalCriteria {
  relevance: { min: 0.8 };
  toneConsistency: { min: 0.7 };
  hallucination: { max: 0 };
}

async function evaluate(response: Response, graders: Grader[]): Promise<Score> {
  const results = await Promise.all(graders.map((g) => g.grade(response)));
  return aggregate(results);
}

// capability eval: hard cases — start at a low pass rate
const capabilityEval = new EvalSet("hard-legal-cases", { target: 0.6 });
// regression eval: passing cases — must stay at 100%
const regressionEval = new EvalSet("passed-cases", { target: 1.0 });
```

## Antipatterns

**Customer support**: operating without criteria produces the recurring "customer complaints are up but we don't know why." Operating without guardrails produces incidents — the agent ships card numbers in a response, or invents a refund policy that doesn't exist — discovered only after the fact.

**Coding agents**: repeating code generation without evaluation leaves no data to judge the impact of a model upgrade. Locking test-pass rate and security-vulnerability detection rate as regression evals detects regressions immediately.

## Invocation example

```
"Add evaluation and guardrails to Response Agent.
Up-front guardrails: PII, cost, length checks.
Post-hoc evaluation: relevance ≥ 0.8, tone consistency ≥ 0.7, hallucination = 0.
Separate capability eval (hard cases) from regression eval (passing cases)."
```

## Related patterns

- Module Contract — the basis for contract-violation detection
- Human Approval — the escalation path when evaluation fails
- Decision Traceability — recording evaluation results

Related pattern

Evaluation and GuardrailsDistinguish acceptable judgment from dangerous judgment via evaluation criteria and safety rules.