Cost Control

Manage token budget, model selection, and call frequency structurally to control the cost curve.

Context

Agentic AI systems incur token cost in proportion to usage. Unlike the fixed infrastructure cost of traditional software, an agent's cost varies by an order of magnitude depending on reasoning frequency, context length, and model choice. Without cost optimization, ROI does not arrive at scale.

Problem

Without a cost-control structure, agents call expensive models for cheap tasks, re-reason over identical input, or forward excessive context — and cost grows exponentially. Without per-agent and per-module attribution, you cannot see where the cost lives or what to optimize.

Forces

High-quality models are accurate but expensive; lightweight models are cheap but risk quality degradation.
Aggressive caching cuts cost but loses freshness; always-recompute keeps freshness but burns tokens.
Strict per-agent budgets control cost but can block important reasoning.

Solution

Structure cost control in three layers. First, route-based model selection — high-stakes decision paths get the best model; classification and filtering get a lightweight model. Second, per-agent budget allocation — each agent and module gets a token budget, usage is tracked in real time, and when the budget is exhausted the agent falls back to a lightweight model or escalates to a human. For models that use extended thinking, track thinking tokens as a separate budget line so the cost of "thinking harder" is visible. Third, cache and batching strategy — pin frequently reused system prompts, tool definitions, and long reference documents with prompt caching. Caching works by prefix match, so place cacheable blocks stably at the front of the prompt and put variable input after them; a cache hit drops the cost of those input tokens to roughly one-tenth of the base rate and cuts latency too. Batch async-capable work to reduce call count. Cost data is a primary input to the OCLS SHARPEN loop: cost anomalies trigger re-tuning of model assignments and budget boundaries.

Judgment question

Does this reasoning step really need this model?

Application scenario

Illustrative scenario — figures and company names in this page are hypothetical for explaining the pattern, not measured data.

A customer-support system initially used the same high-end model across all agents. When inquiries grew 10×, monthly cost ran 3× over budget. Analysis showed the Intake Agent (classification) consumed 40% of tokens but kept 95% accuracy on a lightweight model. The Response Agent kept the strong model but applied cached templates for FAQs. The QA Agent moved from full inspection to sampling. Total cost fell 60% while quality metrics held.

How it breaks

Operating without cost tracking makes "why did our spend double this month" unanswerable. Using the same model for every agent burns expensive reasoning on simple classification; no caching strategy means identical questions consume fresh tokens every time. Putting a timestamp or random ID near the front of the prompt invalidates the prompt-cache prefix on every call and throws the caching benefit away entirely. Without a budget ceiling, traffic spikes turn into unbounded cost.

Implementation pattern bridge

Token Budget Management
Model Routing

Combines route-based model selection (high risk = strong model, low risk = lightweight), per-agent token budgets, and caching/batching strategies to control the cost curve.

Academic References

Practices for Governing Agentic AI Systems — OpenAI
The Rise of Agentic AI: Architectures, Taxonomies, and Evaluation Metrics — Future Internet (MDPI)

Related patterns

Module ContractDeclare the conditions, authority, and failure paths of every execution unit.
Evaluation and GuardrailsDistinguish acceptable judgment from dangerous judgment via evaluation criteria and safety rules.