Wavelength

February 8, 2026

Software Strategy
aillmarchitecturecost-optimizationengineering

Your AI Feature Will Cost 10x What You Think: An Engineer's Guide to LLM Cost Architecture

Token costs, caching strategies, model routing, and the hidden expenses that blow up AI feature budgets. How to architect for cost before it becomes a crisis.

Your prototype AI feature costs $14/day in API calls. Your CTO signs off. You ship it. Three weeks later, the invoice hits $9,200 for the month and someone sends a Slack message that just says "???"

This happens constantly. Not because teams are careless, but because LLM cost behavior is genuinely unintuitive. The economics of traditional software don't apply. Compute that scales linearly with users, cheap storage, predictable egress — none of that maps to token-based pricing. Token economics create non-linear cost curves that punish you for exactly the things production traffic does: longer inputs, retries, edge cases, monitoring. The prototype-to-production cost multiplier for LLM features typically lands between 5x and 15x, though we've seen 50x spikes in data-dense workflows.

The delta between a demo and a deployment is paved with these five compounding costs — and a playbook for keeping them under control.

Why Prototypes Lie About Cost

A prototype AI feature and its production version are running the same model, the same API, the same basic prompt. So why does one cost an order of magnitude more? Five compounding factors.

1. Input length explosion. Your prototype was tested on clean, short examples. A 200-word customer support ticket. A 500-word product description. Production traffic includes the 4,000-word email thread someone forwarded in its entirety, complete with five layers of quoted replies and an HTML signature with a base64-encoded logo. It includes the support ticket that pastes in an entire error log. Input tokens are the single largest cost driver for most LLM features, and production inputs are 3-10x longer than test inputs on average.

2. Retries and error handling. Your prototype doesn't retry. It either works or you tweak the prompt and try again manually. In production, you need retry logic for rate limits (HTTP 429), timeouts, malformed responses, and the occasional model hiccup. A retry strategy of 3 attempts with exponential backoff means your worst-case cost per request is 3x your base cost. If 15% of requests need at least one retry, that adds roughly 20-30% to your bill before you've added a single feature.

3. Edge cases that balloon token counts. The long tail of user behavior produces inputs your prototype never saw. Multi-language content that needs extra context in the system prompt. Malformed data that triggers your fallback prompt (which is longer because it needs to be more explicit). Users who discover they can paste entire documents into a text field designed for a single paragraph. Every edge case you handle in your prompt adds tokens, and the edge cases that don't get handled generate bad outputs that you then need to reprocess.

4. Eval and monitoring overhead. Once your feature is live, you need to know if it's working. That means running evals: sending a representative sample of production inputs through your pipeline and scoring the outputs, either with a second LLM call (LLM-as-judge) or a human review process. Many teams also log full prompt/response pairs for debugging, which means storing and occasionally reprocessing gigabytes of token data. A typical eval suite that runs hourly on 100 production samples costs $50-200/day depending on the model. Nobody budgets for this during the prototype phase.

5. Feature creep at the prompt level. Your original prompt was 400 tokens. Then someone asked for the output in a specific JSON schema (+80 tokens of formatting instructions). Then you added few-shot examples to handle a category the model kept getting wrong (+300 tokens). Then you added safety guardrails (+150 tokens). Then someone wanted the model to explain its reasoning for audit purposes (+100 tokens of chain-of-thought instruction). Your system prompt is now 1,030 tokens, and it ships with every single request. At 100,000 requests/month on a mid-tier frontier model, that prompt bloat alone costs an extra few hundred dollars a month in input tokens.

Multiply these factors together. 5x longer inputs, 1.25x for retries, 2x system prompt growth, plus eval overhead. You're at 12-15x your prototype cost without any increase in user volume. Add volume growth and you understand why that $14/day becomes $9,200/month.

The Cost Architecture Playbook

The fix isn't "use a cheaper model." That's like optimizing a database by deleting rows. The fix is designing your LLM integration with cost as a first-class architectural constraint. Six strategies, in order of impact.

1. Prompt caching

Most LLM providers now support prompt caching: if the first N tokens of your request match a recently cached prompt prefix, you pay a fraction of the input token cost — typically a 90% discount, though exact pricing varies by provider and changes frequently. This is free money if you structure your prompts correctly.

The key: put your static content first. System instructions, few-shot examples, schema definitions, and safety guardrails should all appear at the beginning of the prompt, before any user-specific content. This maximizes the cacheable prefix. If your system prompt is 1,000 tokens and you're making 100,000 requests/month, prompt caching can save you hundreds of dollars a month on input costs alone. Structure matters.

2. Model routing

Not every request needs your most capable (and expensive) model. A classification task ("is this email a complaint, a question, or a compliment?") doesn't need a frontier reasoning model. A content moderation check doesn't need your most expensive option. Model routing means sending each request to the cheapest model that can handle it reliably.

A practical architecture: use a small, fast model as a classifier/router. Small models are now 50-100x cheaper per token than their frontier counterparts. The router reads the input and decides which model should handle the actual work. Simple classification and extraction tasks stay on the small model. Complex reasoning, long-form generation, and ambiguous cases get routed to a more capable model. In our experience, 60-70% of production traffic for most features can be handled by a small model. That cuts your per-request model cost by 40-50% on average.

The router adds negligible cost but introduces a minor latency hit — usually a worthwhile trade for 50% savings. The trick is building a good eval suite so you can verify the small model's accuracy on each task category. Route aggressively, but measure what matters.

3. Response caching

If two users ask the same question (or close enough), you don't need to call the LLM twice. In our deployments, semantic caching catches 15-40% of requests for features like FAQ bots, documentation search, and customer support classification.

Implementation: embed each incoming request, compare against your cache of recent request/response pairs using cosine similarity, and return the cached response if similarity exceeds your threshold (typically 0.92-0.95). The embedding call costs roughly 100x less than the LLM generation call. For high-volume features with repetitive inputs, this is the single highest-impact optimization.

Two caveats. First, response caching works poorly for personalized outputs or tasks where slight input differences should produce meaningfully different results. Don't cache a summarization feature where the input is always different. Do cache a classification feature where 200 variations of "I want to cancel my subscription" should all return the same category. Second, be cautious with semantic caching in high-stakes domains like legal, medical, or financial compliance — a false cache hit on a subtly different query can produce dangerously wrong answers. Tighten your similarity threshold or skip caching entirely for these use cases.

4. Max token guardrails

Set explicit max_tokens on every API call. Without this, a single malformed request can generate a 4,000-token response when you expected 200 tokens. Multiply that by a few thousand requests and you've got an invoice surprise.

More importantly, set input token limits in your application layer. Truncate or chunk inputs that exceed your expected range. If your feature is designed to classify customer support tickets, and the average ticket is 300 tokens, set a hard limit at 1,500 tokens. Anything longer gets truncated (with a note to the user) rather than sent to the model at full length. This single guardrail would have prevented the most expensive production incidents we've seen.

5. Batching

If your feature processes items that don't need real-time responses (nightly report generation, bulk classification, content moderation queues), use batch APIs. Both Anthropic and OpenAI offer batch endpoints that process requests asynchronously at 50% of the standard per-token price. The trade-off is latency: batch jobs complete within 24 hours rather than seconds. For any offline processing pipeline, this is a straightforward 50% cost reduction.

Even for near-real-time features, micro-batching can help. If your system receives 10 classification requests within a 2-second window, you can combine them into a single prompt ("Classify each of the following 10 items:") rather than making 10 separate API calls. This reduces per-request overhead and, with prompt caching, dramatically cuts your input token costs.

6. Structured output constraints

When you need JSON or a specific format, use the provider's structured output features (function calling, tool use, JSON mode) rather than asking the model in natural language to format its response. Structured output modes are more token-efficient because the model doesn't need to generate formatting tokens, and they eliminate the retry costs from malformed responses. A JSON schema constraint typically reduces output tokens by 15-25% compared to a natural language formatting instruction, and drops the malformed-response rate from 3-5% to near zero.

A Real Cost Architecture

Here's what this looks like in practice. Based on a composite of support-platform projects we've delivered, a typical engagement involves AI-powered ticket categorization and suggested response drafting for a platform handling around 50,000 tickets/month.

The prototype used a frontier model for everything. Estimated monthly cost: ~$800. We built the cost model before writing a line of production code and estimated actual production cost at $6,500/month with naive deployment.

The architecture we shipped in a 100-hour sprint:

  • Router layer (small model): classifies ticket type, urgency, and language. Handles 65% of tickets that only need classification. Cost: ~$40/month.
  • Draft generation (mid-tier frontier model): generates suggested responses for the 35% of tickets that need them. Cost: ~$1,200/month.
  • Prompt caching: system prompt (850 tokens of instructions and examples) cached across all requests. Savings: ~$180/month.
  • Response cache: semantic deduplication for the 22% of tickets that are near-identical ("where is my order?" variations). Savings: ~$400/month.
  • Input guardrails: tickets truncated at 2,000 tokens. The 8% of tickets that exceeded this were mostly forwarded email threads where the relevant content was in the first few paragraphs anyway.

Final monthly cost: ~$1,100. That's 83% less than naive deployment and only 37% more than the (wildly inaccurate) prototype estimate. The difference between "AI feature we can afford to run" and "AI feature that gets killed in Q3 budget review."

When to Do This Work

The worst time to think about LLM costs is after your first production invoice. The second worst time is during the build phase, when you're making architectural decisions that lock in cost structures.

The right time is during discovery. Before you write code, before you pick a model, you should have a cost model that answers: what does this feature cost per request, per user, per month at 10x current volume? What are the input length distributions in real production data (not your test set)? Where are the caching opportunities? Which subtasks can run on a smaller model?

We spec cost architecture during the discovery phase of every AI feature we build. The cost model is a deliverable alongside the technical spec. It includes per-request cost estimates at three traffic levels, a caching strategy, a model routing plan, and specific guardrails with dollar-value justifications. This takes about 4-6 hours of work, and it has saved clients from shipping features that would have been killed within two months of launch.

Cost Is Architecture

LLM costs are not like server costs. They don't scale linearly, they're sensitive to input characteristics you don't control, and they compound across retry logic, monitoring, and prompt evolution in ways that are invisible until the invoice arrives. The teams that run AI features profitably in production are the ones that treat cost as an architectural constraint from day one, not a line item they optimize later. Build the cost model first. Then build the feature.

Ready to build something?

Book a free discovery call. We'll talk about what you need and whether we're a good fit. No pitch decks, no pressure.

or email hello@wavelength.computer