Real AI Costs in Production 2026: Optimisation from API to GPU

Real AI Costs in Production 2026: Optimisation from API to GPU¶

“AI is cheap,” say the vendor slides. Reality: an enterprise company with 50,000 queries per day on a GPT-4 class model pays $15,000–$45,000 per month for inference alone. And that does not include embeddings, fine-tuning or infrastructure. This is a guide to the real costs — and strategies that reduce them by 50–80%.

Pricing Landscape at the Start of 2026¶

The LLM API market has gone through a massive price war over the past year. Prices have dropped 60–90% compared to early 2024. But beware — the price per token is only part of the story. Real costs depend on how many tokens you generate, and output tokens are 3–5× more expensive than input.

Model (Q1 2026)	Input / 1M tokens	Output / 1M tokens	Typical use case
GPT-4.1	$2.00	$8.00	General purpose, coding
GPT-4.1 mini	$0.40	$1.60	Cost-efficient tasks
Claude Sonnet 4	$3.00	$15.00	Complex reasoning, coding
Claude Haiku 3.5	$0.80	$4.00	Fast responses, classification
Claude Opus 4	$15.00	$75.00	Frontier reasoning
Gemini 2.5 Pro	$1.25	$10.00	Multimodal, long context
Gemini 2.5 Flash	$0.15	$0.60	High-volume, low-cost
DeepSeek V3	$0.28	$0.42	Budget reasoning
Llama 3.3 70B (self-hosted)	~$0.20*	~$0.20*	On-premise, data sovereignty

* Self-hosted price is approximate — depends on GPU hardware, utilisation and amortisation. Includes A100/H100 hosting + electricity.

What One Query Costs: Cost per Query Breakdown¶

A typical enterprise query (RAG pipeline with context) averages 2,000 input tokens (prompt + retrieved context) and 500 output tokens (response). Based on this:

Model	Cost per query	50K queries/day	Monthly
GPT-4.1	$0.008	$400	$12,000
GPT-4.1 mini	$0.0016	$80	$2,400
Claude Sonnet 4	$0.0135	$675	$20,250
Claude Haiku 3.5	$0.0036	$180	$5,400
Gemini 2.5 Flash	$0.0006	$30	$900
DeepSeek V3	$0.00077	$38.50	$1,155

The difference between the most expensive and cheapest option is 22×. And we are talking about a simple RAG query. For agentic systems where a single user request generates 5–15 LLM calls, costs multiply accordingly.

Hidden Costs Vendors Do Not Mention¶

API pricing is the tip of the iceberg. The full TCO includes:

Embedding generation — every document in the knowledge base must go through an embedding model. For 100K documents, that is a one-off $50–200, but re-indexing on update costs ongoing
Vector database hosting — Pinecone $70+/month, managed Qdrant $100+/month, self-hosted requires RAM (1M vectors ≈ 4–8 GB RAM)
Prompt engineering and evals — 20–40% of engineering time goes into prompts, testing and iterations. This is your most expensive cost
Observability — LangSmith, Langfuse, custom — $200–2,000/month for production monitoring
Guardrails and safety — content filtering, PII detection, compliance checks — additional latency and costs
Retry and error handling — rate limits, 5xx errors, timeout retries = 10–20% extra calls

Real-World Example: Enterprise Chatbot¶

A company with 2,000 employees, internal knowledge base chatbot. 50,000 queries/day, RAG pipeline with Claude Sonnet.

API inference: $20,250/month · Embeddings + vector DB: $500/month · Observability: $500/month · Engineering (0.5 FTE): $5,000/month

Total: ~$26,250/month = $315,000/year

Strategy #1: Semantic Caching¶

The simplest and most effective optimisation. 30–60% of queries in enterprise chatbots are repeated (or semantically similar). Instead of a new LLM call, you return a cached response.

How it works: Query → embedding → similarity search in cache → if similarity > 0.95, return cached response
Tools: GPTCache, Redis + vector search, custom implementation with pgvector
Typical savings: 30–50% of API calls, latency from 2–5s to <100ms for cache hits
Watch out for: Cache invalidation on knowledge base changes, TTL policy, cache poisoning

Strategy #2: Model Routing (Smart Cascading)¶

Not every query needs a frontier model. “How many employees do we have?” can be handled by a model at $0.0006/query. “Analyse this contract and identify risks” needs a model at $0.013/query.

Principle: A classifier (small model or rule-based) evaluates query complexity and routes to the appropriate model
Architecture: Input → Complexity classifier → Router → [Small model | Medium model | Large model]
Typical split: 60% small model, 30% medium, 10% large = average cost drops by 60–70%
Tools: Martian, Portkey, Unify.ai, or a custom router with embeddings-based classification

Routing in Practice: 68% Savings¶

Without routing: 50,000 queries × Claude Sonnet = $20,250/month

With routing: 30,000 × Gemini Flash ($900) + 15,000 × GPT-4.1 mini ($720) + 5,000 × Claude Sonnet ($2,025) = $3,645/month

Savings: $16,605/month (82%)

Strategy #3: Prompt Optimisation¶

Every unnecessary token costs money. And most prompts are 2–3× longer than they need to be.

System prompt audit: Shorten system prompts. 500 tokens of instructions → 150 tokens with the same result = 70% savings on system prompt overhead
Context window management: Do not send the entire conversation history. Summarise, trim or use a sliding window
Retrieved context pruning: RAG often returns 5–10 chunks. A reranker (Cohere Rerank, BGE Reranker) selects the top 2–3 and discards the rest
Output length control: Set max_tokens. Without a limit, the model generates until it decides to stop — and output tokens are 3–5× more expensive

Strategy #4: Knowledge Distillation¶

Have a frontier model that handles your use case perfectly? Distil its knowledge into a smaller model. Result: 90% of the quality at 10% of the cost.

Process: Large model generates training data → Fine-tune a small model on that data → Deploy the small model
Example: GPT-4 generates 10,000 examples for ticket classification → Fine-tune Llama 3.3 8B → Deploy on your own GPU at $0.0002/query
When it works: Tasks with clearly defined scope (classification, extraction, summarisation). Does not work for open-ended reasoning
Tools: OpenAI fine-tuning API, Anyscale, Modal, custom training pipeline with PEFT/LoRA

Strategy #5: Self-Hosting for High Volume¶

Above a certain volume, self-hosting is cheaper than API. The break-even point depends on the model and utilisation:

Setup	Monthly cost	Break-even vs API
Llama 3.3 70B on 2× A100 (cloud)	~$4,500	~150K queries/day vs GPT-4.1
Llama 3.3 8B on 1× L40S (cloud)	~$800	~25K queries/day vs GPT-4.1 mini
Mistral 7B on-premise (1× A100)	~$200 (electricity)	Immediately, but CapEx $15K–25K

Self-hosting makes sense when: (a) volume exceeds break-even, (b) data must not leave your infrastructure (regulation, compliance), or (c) you need a custom model and fine-tuning is simpler locally.

Bonus: Prompt Caching from Providers¶

Both Anthropic and OpenAI offer prompt caching at the API level — repeated prefixes (system prompt, conversation context) are cached and charged at a discount:

Anthropic: Cached input at 10% of the standard price (90% discount). Cache write at 125% of the standard price. TTL 5 minutes
OpenAI: Automatic caching for repeated prefixes. Cached input at 50% of the standard price
Impact: For a RAG pipeline with 1,500 tokens of system prompt and 500 tokens of context — a cache hit saves 50–90% of input costs

Optimisation Roadmap: From Day 1 to Month 6¶

Week 1–2: Instrumentation — Add metrics: cost per request, tokens in/out, latency, model. You cannot optimise what you do not measure
Week 3–4: Prompt optimisation — Shorten prompts, add a reranker, set max_tokens. Savings: 20–30%
Month 2: Semantic caching — Implement caching for repeated queries. Savings: another 20–40%
Month 3: Model routing — Classifier + multi-model setup. Savings: another 30–50%
Month 4–6: Distillation/self-hosting — For high-volume, well-defined tasks. Savings: another 50–80% on those tasks

Conclusion¶

AI in production does not have to cost hundreds of thousands. But without optimisation, it will. Key takeaways:

Price per token is only part of TCO — engineering time, observability and infrastructure are often more expensive than the API
Model routing is the single biggest win — 60–80% savings with minimal quality loss
Semantic caching is a quick win with ROI within 2 weeks
Self-hosting makes sense from 100K+ queries/day or when compliance requires it
Start with instrumentation — you cannot optimise what you do not measure

CORE SYSTEMS

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Need help with implementation? Schedule a meeting