How to reduce LLM token costs
Token bills add up fastest where you least expect them — in the structured data you paste into prompts. Here are seven techniques that actually move the number, ordered from highest to lowest leverage for most applications.
1. Compress structured data with TOON
If your prompts include JSON — API responses, database rows, tool outputs, RAG context — this is usually the biggest single win. Converting that JSON to TOON removes the braces, brackets, commas, and repeated field names that JSON spends tokens on, typically cutting 30-60% off the structured portion of the prompt with no loss of information.
2. Trim the context you actually send
Most prompts carry data the model never needs. Drop unused fields before sending, summarize long documents instead of pasting them whole, and select only the top-k retrieved chunks in a RAG pipeline rather than everything that matched.
3. Use prompt caching
OpenAI, Anthropic, and Google all support caching a stable prompt prefix. Put your fixed system instructions and schema at the top so repeated calls read them from cache at a fraction of the price, and vary only the part that changes per request.
4. Right-size the model
Route easy, high-volume calls to a smaller, cheaper model and reserve the frontier model for the requests that genuinely need it. A tiered setup often cuts cost more than any prompt tweak, because the bulk of traffic moves to a cheaper tier.
5. Batch where latency allows
Most providers offer a batch API at roughly half price for work that does not need an instant response — overnight enrichment, evals, bulk classification. If it is not user-facing, batch it.
6. Cap output length
Output tokens cost more than input tokens. Set a sensible max_tokens, ask for concise answers, and request structured output so the model does not pad the response with prose you will throw away.
7. Measure before and after
You cannot optimize what you do not count. Log token usage per request type, find the few endpoints that dominate the bill, and apply the techniques above there first.
Start with the easy win: paste a representative payload into the converter to see your exact TOON savings on your real data before you change anything in your prompts.