Skip to content
_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN DE
Let's talk

Real AI Costs in Production 2026: Optimisation from API to GPU

07. 02. 2026 Updated: 24. 03. 2026 7 min read CORE SYSTEMSai
Real AI Costs in Production 2026: Optimisation from API to GPU

Real AI Costs in Production 2026: Optimisation from API to GPU

“AI is cheap,” say the vendor slides. Reality: an enterprise company with 50,000 queries per day on a GPT-4 class model pays $15,000–$45,000 per month for inference alone. And that does not include embeddings, fine-tuning or infrastructure. This is a guide to the real costs — and strategies that reduce them by 50–80%.

Pricing Landscape at the Start of 2026

The LLM API market has gone through a massive price war over the past year. Prices have dropped 60–90% compared to early 2024. But beware — the price per token is only part of the story. Real costs depend on how many tokens you generate, and output tokens are 3–5× more expensive than input.

Model (Q1 2026) Input / 1M tokens Output / 1M tokens Typical use case
GPT-4.1 $2.00 $8.00 General purpose, coding
GPT-4.1 mini $0.40 $1.60 Cost-efficient tasks
Claude Sonnet 4 $3.00 $15.00 Complex reasoning, coding
Claude Haiku 3.5 $0.80 $4.00 Fast responses, classification
Claude Opus 4 $15.00 $75.00 Frontier reasoning
Gemini 2.5 Pro $1.25 $10.00 Multimodal, long context
Gemini 2.5 Flash $0.15 $0.60 High-volume, low-cost
DeepSeek V3 $0.28 $0.42 Budget reasoning
Llama 3.3 70B (self-hosted) ~$0.20* ~$0.20* On-premise, data sovereignty

* Self-hosted price is approximate — depends on GPU hardware, utilisation and amortisation. Includes A100/H100 hosting + electricity.

What One Query Costs: Cost per Query Breakdown

A typical enterprise query (RAG pipeline with context) averages 2,000 input tokens (prompt + retrieved context) and 500 output tokens (response). Based on this:

Model Cost per query 50K queries/day Monthly
GPT-4.1 $0.008 $400 $12,000
GPT-4.1 mini $0.0016 $80 $2,400
Claude Sonnet 4 $0.0135 $675 $20,250
Claude Haiku 3.5 $0.0036 $180 $5,400
Gemini 2.5 Flash $0.0006 $30 $900
DeepSeek V3 $0.00077 $38.50 $1,155

The difference between the most expensive and cheapest option is 22×. And we are talking about a simple RAG query. For agentic systems where a single user request generates 5–15 LLM calls, costs multiply accordingly.

Hidden Costs Vendors Do Not Mention

API pricing is the tip of the iceberg. The full TCO includes:

  • Embedding generation — every document in the knowledge base must go through an embedding model. For 100K documents, that is a one-off $50–200, but re-indexing on update costs ongoing
  • Vector database hosting — Pinecone $70+/month, managed Qdrant $100+/month, self-hosted requires RAM (1M vectors ≈ 4–8 GB RAM)
  • Prompt engineering and evals — 20–40% of engineering time goes into prompts, testing and iterations. This is your most expensive cost
  • Observability — LangSmith, Langfuse, custom — $200–2,000/month for production monitoring
  • Guardrails and safety — content filtering, PII detection, compliance checks — additional latency and costs
  • Retry and error handling — rate limits, 5xx errors, timeout retries = 10–20% extra calls

Real-World Example: Enterprise Chatbot

A company with 2,000 employees, internal knowledge base chatbot. 50,000 queries/day, RAG pipeline with Claude Sonnet.

API inference: $20,250/month · Embeddings + vector DB: $500/month · Observability: $500/month · Engineering (0.5 FTE): $5,000/month

Total: ~$26,250/month = $315,000/year

Strategy #1: Semantic Caching

The simplest and most effective optimisation. 30–60% of queries in enterprise chatbots are repeated (or semantically similar). Instead of a new LLM call, you return a cached response.

  • How it works: Query → embedding → similarity search in cache → if similarity > 0.95, return cached response
  • Tools: GPTCache, Redis + vector search, custom implementation with pgvector
  • Typical savings: 30–50% of API calls, latency from 2–5s to <100ms for cache hits
  • Watch out for: Cache invalidation on knowledge base changes, TTL policy, cache poisoning

Strategy #2: Model Routing (Smart Cascading)

Not every query needs a frontier model. “How many employees do we have?” can be handled by a model at $0.0006/query. “Analyse this contract and identify risks” needs a model at $0.013/query.

  • Principle: A classifier (small model or rule-based) evaluates query complexity and routes to the appropriate model
  • Architecture: Input → Complexity classifier → Router → [Small model | Medium model | Large model]
  • Typical split: 60% small model, 30% medium, 10% large = average cost drops by 60–70%
  • Tools: Martian, Portkey, Unify.ai, or a custom router with embeddings-based classification

Routing in Practice: 68% Savings

Without routing: 50,000 queries × Claude Sonnet = $20,250/month

With routing: 30,000 × Gemini Flash ($900) + 15,000 × GPT-4.1 mini ($720) + 5,000 × Claude Sonnet ($2,025) = $3,645/month

Savings: $16,605/month (82%)

Strategy #3: Prompt Optimisation

Every unnecessary token costs money. And most prompts are 2–3× longer than they need to be.

  • System prompt audit: Shorten system prompts. 500 tokens of instructions → 150 tokens with the same result = 70% savings on system prompt overhead
  • Context window management: Do not send the entire conversation history. Summarise, trim or use a sliding window
  • Retrieved context pruning: RAG often returns 5–10 chunks. A reranker (Cohere Rerank, BGE Reranker) selects the top 2–3 and discards the rest
  • Output length control: Set max_tokens. Without a limit, the model generates until it decides to stop — and output tokens are 3–5× more expensive

Strategy #4: Knowledge Distillation

Have a frontier model that handles your use case perfectly? Distil its knowledge into a smaller model. Result: 90% of the quality at 10% of the cost.

  • Process: Large model generates training data → Fine-tune a small model on that data → Deploy the small model
  • Example: GPT-4 generates 10,000 examples for ticket classification → Fine-tune Llama 3.3 8B → Deploy on your own GPU at $0.0002/query
  • When it works: Tasks with clearly defined scope (classification, extraction, summarisation). Does not work for open-ended reasoning
  • Tools: OpenAI fine-tuning API, Anyscale, Modal, custom training pipeline with PEFT/LoRA

Strategy #5: Self-Hosting for High Volume

Above a certain volume, self-hosting is cheaper than API. The break-even point depends on the model and utilisation:

Setup Monthly cost Break-even vs API
Llama 3.3 70B on 2× A100 (cloud) ~$4,500 ~150K queries/day vs GPT-4.1
Llama 3.3 8B on 1× L40S (cloud) ~$800 ~25K queries/day vs GPT-4.1 mini
Mistral 7B on-premise (1× A100) ~$200 (electricity) Immediately, but CapEx $15K–25K

Self-hosting makes sense when: (a) volume exceeds break-even, (b) data must not leave your infrastructure (regulation, compliance), or (c) you need a custom model and fine-tuning is simpler locally.

Bonus: Prompt Caching from Providers

Both Anthropic and OpenAI offer prompt caching at the API level — repeated prefixes (system prompt, conversation context) are cached and charged at a discount:

  • Anthropic: Cached input at 10% of the standard price (90% discount). Cache write at 125% of the standard price. TTL 5 minutes
  • OpenAI: Automatic caching for repeated prefixes. Cached input at 50% of the standard price
  • Impact: For a RAG pipeline with 1,500 tokens of system prompt and 500 tokens of context — a cache hit saves 50–90% of input costs

Optimisation Roadmap: From Day 1 to Month 6

  1. Week 1–2: Instrumentation — Add metrics: cost per request, tokens in/out, latency, model. You cannot optimise what you do not measure
  2. Week 3–4: Prompt optimisation — Shorten prompts, add a reranker, set max_tokens. Savings: 20–30%
  3. Month 2: Semantic caching — Implement caching for repeated queries. Savings: another 20–40%
  4. Month 3: Model routing — Classifier + multi-model setup. Savings: another 30–50%
  5. Month 4–6: Distillation/self-hosting — For high-volume, well-defined tasks. Savings: another 50–80% on those tasks

Conclusion

AI in production does not have to cost hundreds of thousands. But without optimisation, it will. Key takeaways:

  • Price per token is only part of TCO — engineering time, observability and infrastructure are often more expensive than the API
  • Model routing is the single biggest win — 60–80% savings with minimal quality loss
  • Semantic caching is a quick win with ROI within 2 weeks
  • Self-hosting makes sense from 100K+ queries/day or when compliance requires it
  • Start with instrumentation — you cannot optimise what you do not measure
Share:

CORE SYSTEMS

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Contact us
Need help with implementation? Schedule a meeting