Use prompt caching for repeated system prompts
Long, repeated system or tool-definition blocks can cost 90% less when cached. Restructure prompts so static content comes first.
A practical, opinionated set of techniques for reducing AI API costs without sacrificing quality.
Long, repeated system or tool-definition blocks can cost 90% less when cached. Restructure prompts so static content comes first.
Most providers offer 50% off when results can return within 24 hours. Ideal for evals, backfills, and enrichment.
Classify intent with a small model, then escalate. A simple router can cut spend 60%+ on mixed workloads.
Output tokens are typically 4–5x the price of input. Set max_tokens, request concise schemas, and prefer structured outputs.
Drop turn-history older than what the model needs. Summarize older state into a short memory block.
Native structured outputs reduce retries and wasted tokens from malformed JSON.
Sub-$0.30 models handle most extraction and routing tasks at near-frontier accuracy.
Tag every call with feature, route, and model. You cannot optimize what you cannot attribute.
Run a 1000-call simulation against representative traffic. Project monthly spend before launch.