Cost control patterns for Azure OpenAI in production
Token spend on LLM features can go non-linear quickly. Here are the patterns I use to keep Azure OpenAI bills predictable without compromising the experience.
Every team that ships an LLM feature gets the same surprise sometime in month two: the Azure OpenAI bill is much, much bigger than the proof-of-concept suggested. Token consumption is sneaky.
Here are the cost control patterns I deploy on production Azure OpenAI workloads.
1. Cap context length explicitly, don't let it drift
Models accept enormous context windows now. That doesn't mean you should fill them.
A request with 100k input tokens and 1k output tokens costs roughly the same as 100 requests of 1k input + 1k output. Most chat sessions don't need the full conversation history; trim aggressively.
Pattern: keep last N user/assistant turns plus a summary of older turns. Cap total input tokens at a hard maximum (I usually start at 8k for chat, 32k for analysis tasks). Reject or summarize if exceeded.
2. Pick the cheapest model that meets your quality bar
Most teams reach for the most powerful model by default. Then they run a quality eval and discover that gpt-4o-mini does the job 90% as well at one-tenth the cost.
The discipline is: define your quality threshold (with an eval set), then run every candidate model through it, pick the cheapest one that passes.
3. Cache common responses
A surprising portion of LLM traffic in production is the same question, asked many times. Cache aggressively.
Two layers:
- Exact-match cache: straight key-value (Redis), TTL of a few hours. Trivial to implement, hits 5–15% of traffic in most apps.
- Semantic cache: store query embeddings, look up by similarity. Hits more queries (10–30%) but requires careful tuning to avoid serving "close enough" responses.
In production, exact-match alone has paid for the engineering time on every project I've shipped.
4. Batch where you can
For background processing tasks (summarization of overnight data, classification of new tickets), use the Batch API. It's significantly cheaper, with a 24-hour SLA. Real-time chat can't use it; batch jobs absolutely should.
Don't batch real-time user requests by buffering them, you will add latency that costs more in user trust than the tokens saved.
5. Stream responses (yes, this saves money indirectly)
Streaming doesn't reduce token costs, but it lets users interrupt early if the response isn't useful. An interrupted response stops generation, which stops billing. Across millions of requests, this matters.
It also reduces the temptation users have to retry — a non-streamed slow response often gets retried, doubling the cost.
6. Prompt compression
Two things help:
- Use compact instructions. "You are a helpful assistant for our HR system. Answer based on the context provided." beats a 2000-token persona description.
- Compress retrieved context for RAG. Use a smaller model to summarize retrieved chunks before passing them to the answer model. Costs a small extra call but slashes the main call.
7. Use Provisioned Throughput Units (PTUs) once you're predictable
PTUs are pre-purchased capacity at a flat monthly rate, regardless of token volume. They're cheaper than pay-as-you-go if your load is consistent. The trick is knowing when you've hit that point.
Rule of thumb: if you're spending more than $1500/month on a single model with steady traffic, run the PTU calculator. The break-even is usually somewhere there.
For spiky workloads (occasional bursts), pay-as-you-go is still cheaper.
8. Region pricing arbitrage
Azure OpenAI pricing varies slightly by region. For workloads where data residency allows it, deploy to a cheaper region. For those that don't, this isn't relevant — but worth checking.
9. Set Azure budgets and alerts on the resource
Cost Management → Budgets at the resource level. Alert at 50%, 90%, and 100%. The alert at 100% should not automatically disable the resource. You don't want production outages from a cost spike but it should page someone.
10. Track tokens per request as a custom metric
In Application Insights, log:
{ requestId, model, promptTokens, completionTokens, totalCost }
Once you have this in App Insights, you can build dashboards showing:
- Token cost per user.
- Token cost per feature.
- Outliers (requests with >5x the median tokens).
That last one is gold. It's how you find the bug that's accidentally including the entire database in every prompt.
What I deliberately don't do
- Aggressive output token capping: It produces truncated responses and worse user experience. Save tokens upstream, not at the very end.
- Forcing low-temperature for everything: Temperature affects cost very weakly. Tune it for quality, not for budget.
- Switching to non-Azure models for cost: The operational cost of multi-vendor exceeds the savings, in my experience.
The summary
LLM cost control is mostly about input tokens: cap them, cache them, compress them. Output tokens come along for the ride. Get the input side right and your bills become predictable.