GOVERNANCE.md

OVERVIEW

What is GOVERNANCE.md

A single-page document that sits at the project root. It declares who owns what, under what conditions judgment is made, how collaboration runs, and against what criteria improvement happens. Without a documented structure, both loops and execution lose direction.

SPEC

A verifiable specification

GOVERNANCE.md is both a guide for humans and a specification that the governance-reviewer skill checks per PR. Section order, required composition, and forbidden signals are expressed as rules.

Section order

OWN→CONTRACT→LAYER→SHARPEN — Missing sections or order violations are blocked by linter rules.

Run the rules yourself →

RULE

Linter rules

Each rule is checked by the governance-reviewer skill against the PR diff. Errors must be resolved before merge; warnings are left as review comments.

Rule	Severity	Target
`missing-section`	error	§01–§05 중 하나 이상이 비어 있음
`section-order`	error	OWN → CONTRACT → LAYER → SHARPEN 순서 위반
`authority-sprawl`	warning	OWN 섹션에 권한 범위가 선언되지 않음
`contract-gap`	error	CONTRACT 섹션에 input·output·reject_when 누락
`observability-gap`	warning	LAYER 섹션에 trace·log 명세 누락
`validation-gap`	error	SHARPEN 섹션에 metric·임계값·리뷰 주기 누락
`broken-ref`	error	frontmatter의 참조가 본문에서 정의되지 않음

YAML

Machine-readable frontmatter

A YAML representation equivalent to the human markdown. The body targets humans; the frontmatter targets the linter and coding agents. Divergence between the two produces a broken-ref violation.

---
version: 0.1
product: 고객 상담 AI
maturity: 3
agents:
  intake:
    type: ai
    owner: support-team
    authority: [classify.read]
  response:
    type: ai
    owner: support-team
    authority: [reply.draft, refund.draft]
  cs-manager:
    type: human
    authority: [refund.approve, escalation.resolve]
contracts:
  intake:
    input: { message: string }
    output: { category: enum, urgency: enum }
    reject_when: ["len(message) > 5000"]
  response:
    input: { category: "ref(contracts.intake.output)", history: ref }
    output: { reply: string, reasoning: string }
    reject_when: ["contains(pii)", "mentions(competitor)"]
metrics:
  accuracy: { target_min: 0.95, review: weekly }
  cost_per_call_krw: { target_max: 500, review: daily }
  escalation_rate: { target_max: 0.05, review: weekly }
---

§01

Diagnostic Snapshot

Assessment Snapshot — summarize /assessment results at the top

Record the latest assessment date and the chosen preset.
List the top three priority patterns with scores — see at a glance which patterns are most urgent.
Record one or two key tensions with response guidance — the rationale for trade-off decisions.
Summarize the recommended maturity stage — where you are now and what comes next.
Keep the full assessment report at `docs/assessment-YYYY-MM-DD.md` and reference the path.

Per-stage deliverables

What the current stage requires. Compare against the recommended stage from /assessment.

1. Single-Agent Start

Core deliverables

Per-module draft input/output contracts
Baseline execution logs (call count, cost, success/failure)
List of points where roles conflict

Risks of this stage

Skipping to stage 2 without logs makes partitioning intuition-driven and produces the wrong boundaries.
Partitioning without contracts leaves agent interfaces tacit, multiplying collaboration failures.

2. Responsibility Separation

Core deliverables

Per-agent responsibility statements
Handoff rules and context-passing schemas
Per-agent independent evaluation criteria

Risks of this stage

Overly fine partitioning makes handoff overhead more expensive than the single-agent baseline.
Mis-cut boundaries split one responsibility across two agents — accountability blurs again.

3. Multi-Agent Collaboration

Core deliverables

Stabilized collaboration flow diagrams
Context-routing rules
State/memory separation policy

Risks of this stage

Scaling without governance triggers cost explosion, quality degradation, and security incidents at the same time.
Introducing governance too early imposes rigid rules on still-unstable boundaries and blocks evolution.

4. Governance by Design

Core deliverables

Automated evaluation and guardrail pipelines
Approval classification matrix and escalation rules
OCLS-based recurring review process

Risks of this stage

Governance turning into ritual produces governance theater — the metrics are managed, the quality is not.
Governance becoming a bottleneck unnecessarily constrains agent autonomy and throughput.

## Diagnostic Snapshot

> Last assessment: YYYY-MM-DD
> Preset: <preset name>
> Full report: docs/assessment-YYYY-MM-DD.md

### Top 3 priority patterns
1. **<Pattern name>** (<score>/100)
2. **<Pattern name>** (<score>/100)
3. **<Pattern name>** (<score>/100)

### Key tensions
- <tension description> — <response guidance>

### Recommended maturity
**Stage N — <stage name>**

§02

Ownership Structure

OWN — who (human or AI) owns which outcome

List the product's core outcomes and the owner of each.
Mark whether each owner is human or AI — the type of owner determines how strict the contract must be.
AI-owned: every judgment condition must be explicit, with mandatory guardrails and escalation.
Human-owned: contextual judgment is possible, but the judgment criteria must be documented so the team shares them.
Identify gray zones with unclear ownership and assign them.

# Ownership Structure

| Outcome | Owner | Type | Governance level |
|---------|-------|------|------------------|
| Inquiry classification | Intake Agent | AI | Contract-based auto execution, human escalation on misclassification |
| Response quality | Response Agent | AI | Guardrails + post-hoc QA review |
| Refund approval (> ₩500,000) | CS manager | Human | Pre-approval required |
| Cost-ceiling compliance | Ops team | Human | Daily dashboard monitoring |
| Regulatory compliance | Compliance Officer | Human | External audit response |

Risk signalAuthority Sprawl

As agents multiply, who holds which authority becomes untraceable. Agent identities, access scopes, and execution rights expand tacitly, producing security incidents and accountability gaps at the same time.

§03

Judgment Contracts

CONTRACT — under what conditions which judgment is made

Declare each execution unit's input conditions, output format, and refusal conditions.
Define the classification criteria for auto-approval, post-hoc review, and pre-approval.
Specify the recovery path and fallback behavior on failure.

# Judgment Contracts

## Classification criteria
- Auto-approve: cost < ₩100,000, within existing policy
- Post-hoc review: cost ₩100,000–500,000, new cases
- Pre-approval: cost > ₩500,000, legal responsibility, customer commitments

## Refusal conditions
- Response contains PII → block
- Comparative statements about competitors → escalate

Risk signalContract Gap

Modules and agents run without documented input/output, refusal conditions, or failure modes. Side effects of a module swap or prompt change become unpredictable, and there is no basis to define evaluation criteria.

§04

Collaboration Rules

LAYER — information transfer and approval flow between actors

Define which context to include and exclude when passing information between actors.
Specify collaboration order and dependencies.
Distinguish areas that can run in parallel from those with sequential dependency.

# Collaboration Rules

## Information-transfer scope
- Classification → Response: category, sentiment, prior resolution history
- Response → QA: response draft, referenced policy, judgment rationale
- QA → Ops: quality score, violations, improvement suggestions

## Dependencies
- Classification complete → response start (sequential)
- Response + cost calculation (parallel)
- QA runs asynchronously on every response

Risk signalObservability Gap

Reasoning paths, decision rationale, and handoff reasons go unrecorded. When something fails, the cause cannot be pinpointed and the entire reasoning process turns opaque.

§05

Improvement Criteria

SHARPEN — what to measure and when to adjust the structure

Define core quality metrics and their targets.
Set cost-tracking units and budget ceilings.
Specify the signals that trigger structural adjustment (review cadence, thresholds).

# Improvement Criteria

## Core metrics
| Metric | Target | Review cadence |
|--------|--------|----------------|
| Response accuracy | > 95% | weekly |
| Token cost / request | < ₩500 | daily |
| Escalation rate | < 5% | weekly |

## Adjustment signals
- Accuracy < 90% for 3 consecutive days → revisit the ownership structure
- Cost budget at 80% → re-tune model assignments
- 10+ new refusal patterns → renew the contract

Risk signalValidation Gap

Evaluation criteria and guardrails are missing or exist only as ritual. Quality variance widens, dangerous behavior is found only after the fact, and the feedback loop for improvement is broken.

§06

Related patterns

Which implementation patterns realize each governance commitment. Click a pattern to open its catalog entry.

Governance commitment	Implementation pattern	How they connect
Responsibility Partitioning	Hierarchical Decomposition Parallel Fan-out/Gather	A parent agent decomposes the goal into sub-responsibilities and delegates them. Parallelizable responsibilities are realized as Fan-out; sequentially dependent ones as Hierarchical.
Module Contract	Spec-Driven Development	The contract's input/output schema becomes an executable spec that generates code, docs, and mocks. An MCP Tool Definition is one concrete realization of a module contract.
Context Routing	Sequential Pipeline Routing Pattern	At each pipeline step, rules filter and structure the context passed to the next agent. The Routing Pattern branches inquiries to the appropriate agent based on type.
State and Memory Control	Session Store / Vector Memory	Mostly an infrastructure-design concern. Key implementation work: separating session store (short term) from vector DB / knowledge base (long term) and controlling read/write authority.
Evaluation and Guardrails	Generator-Critic Evaluator-Optimizer	A reviewer agent evaluates the generator agent's output. Up-front guardrails block before execution; post-hoc evaluation is realized through the Generator-Critic loop.
Human Approval	Human-in-the-Loop	Asynchronous approval queues and approve/deny callbacks are the core implementation elements. The default is an asynchronous structure that lets other work continue while approval is pending.
Cost Control	Token Budget Management Model Routing	Combines route-based model selection (high risk = strong model, low risk = lightweight), per-agent token budgets, and caching/batching strategies to control the cost curve.
Decision Traceability	Structured Logging Distributed Tracing	Collects per-agent decision logs in structured form (JSON, OpenTelemetry spans) and implements tracing infrastructure that captures causal relationships.

§07

Review checklist

Governance-layer questions to inspect when authoring or updating GOVERNANCE.md. See /checklist for the full list.

SHARPEN평가 기준이 정량적으로 정의되어 있는가?
SHARPEN고위험 작업의 인간 승인 기준이 명시적인가?
SHARPEN모든 의사결정을 사후에 추적할 수 있는가?
SHARPEN보안 위반 시도의 실시간 탐지와 자동 차단 메커니즘이 구현되어 있는가?
SHARPENSLA 위반 추세를 자동으로 감지하고 알림하는 시스템이 있는가?
SHARPEN모든 의사결정 로그가 규제 감사 요구에 부합하는 보존 기간과 형식으로 저장되는가?
SHARPEN자동화된 평가 파이프라인이 에이전트 출력을 지속적으로 검증하는가?
SHARPEN수집된 데이터의 품질과 신선도가 주기적으로 검증되는가?
SHARPEN에이전트별·모듈별 비용이 실시간으로 추적되고 예산 한도가 설정되어 있는가?
SHARPENAI의 자율 판단 범위를 정기적으로 검토하고 조정하는 프로세스가 있는가?
SHARPEN시스템 품질 저하가 매출 지표에 미치는 영향을 실시간으로 모니터링하는가?
SHARPEN고객 만족도와 시스템 성능 간의 상관관계를 추적하고 있는가?
SHARPEN도메인 규칙 변경 시 영향을 받는 에이전트를 자동으로 식별할 수 있는가?
SHARPEN이해관계자 간 우선순위 충돌을 해소하는 거버넌스 규칙이 있는가?
SHARPEN의사결정 품질을 비즈니스 결과와 연결하여 추적하고 있는가?
SHARPEN경쟁 환경 변화가 제품 우선순위에 반영되는 피드백 루프가 있는가?
SHARPEN사용자 규모 증가에 따른 비용 곡선이 예측되어 있는가?

Go to the full checklist →

§08

Rationale — why OCLS

Four principles implemented across the four OCLS phases. The phase order follows the principle order directly.

Own Every Outcome → OWN

Design around responsibility units, not features

An agent is not a function caller — it is a continuously explainable owner of outcomes.

Contract First → CONTRACT

A module is not a callable tool — it is an execution unit with a contract

Only when a module has a contract does the structure support evaluation, replacement, authority control, and testing.

Layer, Then Scale → LAYER

Scale is possible only when governed through classification and structure

As agents multiply, structuring them into categories, layers, and boundaries is what keeps governance intact and scale sustainable.

Sharpen in Operation → SHARPEN

Architecture is a system that evolves in operation

Responsibility boundaries and policies must keep adjusting from failure logs and evaluation outcomes.

Complete example

A complete GOVERNANCE.md example for a customer-support AI. Drop this at the project root and update it on every OCLS pass.

# GOVERNANCE.md — 고객 상담 AI

> 제품: AI 기반 고객 상담 자동화 시스템
> 단계: 3. 멀티에이전트 협업
> 최종 갱신: 2026-04-17
> 다음 리뷰: 2026-05-01

---

## 진단 스냅샷

> 최종 진단: 2026-04-17
> 프리셋: 고객 대면 서비스
> 전체 리포트: docs/assessment-2026-04-17.md

### Top 3 우선 패턴

1. **컨텍스트 라우팅** (87/100)
2. **평가와 가드레일** (82/100)
3. **모듈 계약** (79/100)

### 핵심 긴장 관계

- 풍부한 고객 경험과 응답 속도의 충돌 — 핵심 응답은 즉시, 부가 컨텍스트는 스트리밍 또는 비동기로 보완한다.
- 신중한 의사결정과 빠른 응답의 충돌 — 고영향 판단은 동기적 검증, 저영향은 비동기 사후 검증으로 분리한다.

### 추천 성숙도

**Stage 3 — 멀티에이전트 협업**

협업 규칙과 정보 전달 체계를 정의해 제품 흐름을 안정화한다.

---

## 0. reopt architecture 컨텍스트

이 문서는 reopt architecture 방법론에 따라 작성되었다.
코딩 에이전트는 이 섹션을 읽고 구조적 판단의 근거로 사용한다.

### OCLS 루프
모든 구조적 판단은 4단계 루프를 따른다:
- **OWN**: 이 결과의 소유자는 누구인가? 사람인가, AI인가?
- **CONTRACT**: 판단 조건, 거절 조건, 실패 경로가 명시되어 있는가?
- **LAYER**: 주체 간 정보 전달과 협업 규칙이 정의되어 있는가?
- **SHARPEN**: 운영 결과로 구조를 개선하는 기준이 있는가?

---

## 1. 소유 구조 (OWN)

각 결과물의 소유자와 권한 범위를 다음 표로 선언한다. 거버넌스 수준은 자동 실행 가능 여부를 결정한다.

| 결과물 | 소유자 | 유형 | 권한·거버넌스 수준 |
|--------|--------|------|---------------------|
| 문의 분류 | Intake Agent | AI | classify.read 권한. 오분류율 > 10% 시 사람 리뷰 |
| 응답 생성 | Response Agent | AI | reply.draft 권한 + 가드레일 + QA Agent 사후 검토 |
| 응답 품질 검증 | QA Agent | AI | 전수 검사. 위반 시 자동 차단 + 사람 알림 |
| 환불 판단 (10만원 이하) | Response Agent | AI | refund.draft 권한 + 자동 승인 + 일간 샘플링 사후 검토 |
| 환불 판단 (10만원 초과) | CS 매니저 | 사람 | refund.approve 권한 — 사전 승인 필수 |
| 비용 한도 | Ops 팀 | 사람 | 일간 대시보드. 예산 80% 시 알림 |
| 규제 준수 | Compliance Officer | 사람 | 월간 감사. PII 관련 즉시 대응 |

### 회색 지대
- 고객이 법적 조치를 언급하는 경우 → CS 매니저 즉시 에스컬레이션
- 기존 정책에 없는 신규 요청 유형 → 주간 리뷰에서 소유자 배정

---

## 2. 판단 계약 (CONTRACT)

### Intake Agent
- 입력: 고객 메시지 (텍스트)
- 출력: { category, sentiment, urgency, language }
- 거절: 메시지 길이 > 5000자 → "메시지를 나눠서 보내주세요" 응답
- 실패: 분류 신뢰도 < 0.7 → "기타" 분류 + 사람 리뷰 플래그

### Response Agent
- 입력: { category, sentiment, customerHistory, relevantPolicies }
- 출력: { response, referencedPolicies, reasoning }
- 거절: PII 포함 요청 → 차단. 경쟁사 비교 → 에스컬레이션
- 모델: 일반 문의 = 경량 모델, 복잡 문의 = 고성능 모델

### QA Agent
- 입력: { response, reasoning, category }
- 출력: { score, violations[], approved }
- 기준: 정확성 ≥ 0.9, 톤 일관성 ≥ 0.85, 정책 준수 = 100%
- 위반 시: score < 0.8 → 자동 차단 + Response Agent 재생성

### 승인 분류
| 등급 | 기준 | 처리 |
|------|------|------|
| 자동 | 기존 정책 범위 + 비용 < 10만원 | 즉시 실행 |
| 사후 검토 | 비용 10만~50만원 또는 신규 케이스 | 실행 후 24시간 내 리뷰 |
| 사전 승인 | 비용 > 50만원, 법적 언급, 고객 약속 | CS 매니저 승인 후 실행 |

---

## 3. 협업 규칙 (LAYER)

모든 핸드오프는 구조화 로그(trace_id, decision, reasoning)로 기록된다.

### 정보 전달 범위
| 출발 | 도착 | 포함 | 제외 |
|------|------|------|------|
| Intake | Response | category, sentiment, 이전 대응 이력 | 원본 메시지 전문 (요약만 전달) |
| Response | QA | 응답 초안, 참조 정책, 판단 근거 | 고객 개인정보 |
| QA | Ops 대시보드 | 품질 점수, 위반 항목 | 개별 응답 내용 |

### 흐름
```
고객 메시지 → Intake (분류) → Response (응답 생성)
                                    ↓
                              QA (품질 검증)
                                    ↓
                            [통과] → 고객 전달
                            [차단] → Response 재생성 (최대 2회)
                            [2회 실패] → 사람 에스컬레이션
```

### 비용 배분
- Intake: 경량 모델 고정 (건당 ~50원)
- Response: 복잡도별 모델 선택 (건당 100~800원)
- QA: 경량 모델 고정 (건당 ~30원)
- 에이전트별 일간 예산 한도 설정

---

## 4. 개선 기준 (SHARPEN)

### 핵심 지표
| 지표 | 현재 | 목표 | 리뷰 주기 |
|------|------|------|----------|
| 응답 정확도 | 93% | > 95% | 주간 |
| 오분류율 | 8% | < 5% | 주간 |
| 평균 응답 시간 | 4.2초 | < 3초 | 일간 |
| 토큰 비용/건 | 420원 | < 500원 | 일간 |
| 에스컬레이션율 | 7% | < 5% | 주간 |
| QA 차단율 | 12% | < 8% | 주간 |
| 고객 만족도 (CSAT) | 4.1 | > 4.3 | 월간 |

### 보정 시그널
| 시그널 | 임계값 | 조치 |
|--------|--------|------|
| 정확도 하락 | < 90% 3일 연속 | 소유 구조 재검토. Response Agent 프롬프트/모델 점검 |
| 비용 급증 | 일간 예산 80% 도달 | 모델 배정 재조정. 캐시 전략 점검 |
| 신규 거절 패턴 | 10건/주 이상 | 계약 갱신. 새 거절 조건 추가 |
| 에스컬레이션 급증 | > 10% 2일 연속 | 승인 분류 기준 재검토 |
| CSAT 하락 | < 3.8 | 전면 리뷰. 응답 톤/정책 재점검 |

### 리뷰 일정
- 일간: 비용 대시보드, 에러 로그 확인 (Ops)
- 주간: 품질 지표 리뷰, 계약 갱신 여부 판단 (팀 전체)
- 월간: GOVERNANCE.md 전체 리뷰 + Assessment 재진단