Geokaun Mountain, Valentia Island, Co. Kerry — K. Mitch Hodge / Unsplash
In the last post, we looked at data architecture — why getting the data layer right is not preparation for your AI project, but the project itself. Assuming you have done that work, or at least started it, the next question is where your AI actually runs, and what it costs when real users start using it.
This is where a lot of otherwise well-designed systems quietly fall apart.
A RAG-based search tool launches successfully. The demo worked, the rollout was smooth, the first users are happy. Then a few weeks later the cloud bill arrives — nearly four times the projection. Latency is creeping past three seconds during peak hours.
No egregious errors were made. The team just hadn’t designed their infrastructure for the realities of scaling an AI workload.
The problem is structural. Development environments are forgiving. You run a few hundred requests, costs are negligible, latency is fine, everything looks good. Production is different in ways that are non-linear. Inference costs scale with usage. Embedding pipelines run constantly rather than on demand. Vector stores grow and their query costs grow with them. Caching behaviour that was irrelevant at low volume becomes critical at scale. None of these problems are hard to solve — but they are hard to retrofit once the system is already live and users are already complaining.
The answer is to treat infrastructure and cost as first-class architectural concerns from the start, not optimisation tasks you will get to later.
AI workloads have a different cost profile from traditional software. Understanding that profile before you build is the difference between a system that scales commercially and one that becomes a liability as it grows.
Inference. The most visible cost, and for most systems the most controllable. Model selection alone can change your cost structure by an order of magnitude — choosing a smaller model for tasks that do not require frontier capability is not a compromise, it is good architecture.
Embedding pipelines. Running embedding operations at ingestion time — and potentially at query time — is often treated as a one-time setup cost. For systems where data changes frequently, it is an ongoing operational cost that compounds quietly.
Vector storage and retrieval. Not free, and not linear with data volume. Query costs in particular can behave unexpectedly as indexes grow — worth stress-testing before you are in production.
Reranking and post-processing. A second model pass that re-scores retrieval candidates. It meaningfully improves output quality and adds latency and cost to every query. Worth it, but worth deciding deliberately rather than discovering in production.
Human review and intervention. The one that almost never appears in infrastructure cost models, and almost always should. AI systems operating in high-stakes domains — customer decisions, clinical data, financial advice, legal documents — require human review loops. Those loops have a real operational cost that compounds with usage. Design them to be efficient from the start, not added as an afterthought when something goes wrong.
Most infrastructure decisions are reversible. These are not, or at least not easily. Getting them right early saves significant rework later.
Managed APIs vs. Self-Hosted Models. Managed endpoints like OpenAI’s API or Anthropic’s Claude are the fastest way to get off the ground. But at scale, they’re typically the most expensive route per-token. If you’re handling sensitive healthcare or financial data, you might be forced into self-hosting open-weight models (like Llama 3 or Mistral) just to meet data residency requirements. Self-hosting demands much higher upfront engineering effort, but it gives you total control over cost and latency. The key is making this decision deliberately, rather than defaulting to an API just because it’s easy.
Caching architecture. At low volume, caching feels like a premature optimisation. At production scale, it is one of the highest-leverage cost controls available. Semantic caching — where similar queries return cached results rather than new inference calls — can reduce inference costs substantially for systems with predictable query patterns. Designing for it from the start is substantially easier than retrofitting it into a live system.
Observability infrastructure. You cannot optimise what you cannot see. Production AI systems need to log inputs, outputs, latency, costs, and quality signals at the request level — not just aggregate metrics. This is the infrastructure that lets you answer “why did that response cost three times as much as expected?” and “why did quality drop last Tuesday?” Build it before you need it, because you will need it.
Cost often gets the most attention in infrastructure conversations. Latency deserves equal weight, because latency is not an engineering metric — it is a product experience.
The tolerance for latency varies enormously by use case. A background document summarisation job that runs overnight has almost unlimited latency budget. A customer-facing chat interface where users are waiting for a response has a budget measured in seconds, and the drop-off in engagement beyond three or four seconds is steep. An agent completing a multi-step task autonomously sits somewhere between those extremes, with the added complexity that it may need to chain multiple model calls.
Understanding your latency requirements before you choose your model and pipeline architecture is not a product management concern that gets handed over the fence. It is an architecture input. Frontier models are more capable and slower. Smaller models are faster and cheaper and sufficient for many tasks. A pipeline that chains three model calls will have minimum latency equal to the sum of those calls. These are architectural choices with direct product consequences.
The goal is not precision — it is order-of-magnitude awareness. Before committing to an architecture, build a simple model that answers:
If the numbers look uncomfortable at 10,000, the architecture needs revisiting before it is built, not after. The inputs are not complicated: average tokens per request, model pricing, embedding pipeline volume, vector store query costs, and an estimate of cache hit rate. Rough numbers are sufficient. The point is to find the inflection point — the usage level at which the system becomes commercially unviable — and make sure that point is well beyond your realistic growth trajectory.
Most teams that end up with cost problems skipped this exercise. Not because it is hard, but because the enthusiasm of early development makes it feel premature.
Get the infrastructure right and you end up with things that are genuinely hard to retrofit once you are in production.
Commercial predictability. You know what the system costs at different usage levels, and you have levers to control that cost as it grows. Inference optimisation, caching, model routing — managed choices rather than emergent surprises.
The ability to improve. Observable systems get better. Systems built without observability get optimised through guesswork and production incidents.
Optionality. Infrastructure designed for portability — where the model provider is an abstraction layer rather than a hard dependency — gives you the ability to switch providers as the model landscape evolves, or to move workloads to self-hosted infrastructure as scale justifies the investment. That optionality has real commercial value in a market that is still changing as fast as this one.
The complexity is usually not the issue. The issue is making these decisions deliberately rather than defaulting into them — and keeping future optionality in mind even when the immediate requirement is simple.
The next post in this series covers evaluation and reliability — how to know your AI is working correctly, and how to build the feedback loops that keep it that way.
If the infrastructure and cost layer is something your team is working through — whether you are designing from scratch or trying to understand why a live system is behaving unexpectedly — get in touch. This is the kind of architectural review that pays for itself quickly.