Hook Head, Co. Wexford — Braden Collum / Unsplash
In the last post, we looked at infrastructure and cost — how to build AI systems that remain commercially viable as they scale, and why observability needs to be designed in from the start rather than bolted on when something goes wrong.
This post is about what you observe. Specifically: how do you know your AI system is working correctly? And how do you build the feedback loops that keep it working as your data, your users, and the models themselves change over time?
These are harder questions than they first appear. They are also the questions that most teams defer longest — and pay the highest price for deferring.
If you write a standard Python function, you can write a unit test for it. The code is deterministic. It either returns the expected integer or it doesn’t. You know immediately if a deployment broke something.
Working with Large Language Models strips away that safety net. Because the output is probabilistic, you aren’t testing for true/false anymore—you’re testing along a messy spectrum of quality. A model can return a response that is grammatically perfect, contextually plausible, completely hallucinated, and confidently expressed. Standard CI/CD checks won’t catch this. A unit test that verifies your function returns a string will pass every single time, telling you absolutely nothing about whether that string is actually useful.
This is not a criticism of AI systems — it is a structural property of probabilistic systems that requires a different approach to quality assurance. The engineering discipline of evaluating AI outputs is still maturing, but the core principles are well enough established that teams have no excuse for shipping production AI without some version of them in place.
The cost of getting this wrong is not abstract. An AI system that produces subtly wrong outputs with high confidence erodes user trust faster than a system that is obviously broken. Users will tolerate a system that occasionally says it does not know. They will not tolerate one that confidently misleads them — and once that trust is gone, it is very difficult to rebuild.
Think of AI evaluation as operating at three levels, each catching different categories of failure.
Automated functional tests are the floor. Deterministic checks that verify the system is behaving as expected at a mechanical level: does the retrieval pipeline return results? Does the response stay within expected length bounds? Does the output match a known correct answer for a fixed set of test cases? Fast, cheap, and should run on every deployment. They catch regressions and infrastructure failures but say nothing about output quality.
LLM-as-judge evaluation handles the nuance that basic scripts can’t. You can use a more capable model to grade the outputs of your production model against a specific rubric. You’re effectively asking the judge model: “Did this response hallucinate? Was the tone appropriate?” It’s not a silver bullet—models have their own biases when judging—but it scales infinitely better than manual QA.
Human review loops remain essential for high-stakes domains like clinical decisions or legal advice. The goal isn’t to eliminate human reviewers, but to respect their time: route the highly ambiguous cases to them, and use their corrections to refine your automated evaluation criteria over time.
A common trap is treating these three approaches as an à la carte menu. In reality, you need all of them. Basic scripts verify the plumbing. LLM-as-judge gives you scalable quality checks. Human review catches the edge cases that machines are blind to. They have to work together.
Before you can evaluate anything, you need to define what you are evaluating against. This is where many evaluation efforts stall — teams build the tooling and then discover they have not agreed on what a good output actually is.
The answer is not a vague aspiration. “Accurate and helpful” is not an evaluation criterion. You need something specific enough that two independent reviewers — human or model — would reach the same conclusion about any given output.
Factual correctness. Is the information in the output accurate? This requires ground truth — a set of known correct answers against which outputs can be checked. For retrieval-augmented systems, it also means verifying that claims in the output are supported by retrieved sources rather than generated from model weights.
Relevance. Does the output actually address what was asked? A response that is technically accurate but misses the point is a quality failure. More subjective than factual correctness, and benefits from explicit examples of relevant and irrelevant responses in your evaluation prompt.
Groundedness. For RAG systems specifically: are the claims in the output traceable to the retrieved context? Hallucination in a RAG system is typically a grounding failure — the model has drifted from the source material and started generating from its own knowledge. Measuring groundedness catches this before users do. It is the evaluation dimension that most teams add too late.
Tone and policy compliance. Does the output match the voice and constraints of your product? Everything from factual brand claims to safety guardrails to the subtler question of whether the response feels right for your context and your users.
Expect it to change. The failure modes that matter most in production are rarely the ones you anticipated before launch.
Even a system that evaluates well on day one will degrade over time. This is one of the properties of AI systems that surprises teams most consistently — you ship something that works, you move on to other things, and some weeks or months later you notice it is behaving differently. Not broken. Just worse.
Data drift. The distribution of inputs your system receives changes as your user base grows and evolves. Queries that were rare in your test set become common in production. Edge cases become mainstream. A retrieval system built for a narrow set of use cases starts receiving queries it was never designed to handle well.
Model drift. If you are using a hosted API, the model you are calling today is not necessarily the model you will be calling in six months. Providers update models, sometimes with breaking changes to behaviour that are not documented as breaking changes. Your outputs can change without any action on your part. Pin model versions where possible and monitor for output distribution changes.
Data freshness drift. For systems that reason over your own data, the data changes and your retrieval index needs to keep up. A support assistant whose knowledge base was last indexed three months ago will confidently answer questions based on policies that no longer exist. Keeping the index current is an operational discipline, not a one-time setup task.
None of these drift sources announces itself. Catching them requires monitoring output quality signals over time — not just system health metrics like latency and error rates. This is why the observability infrastructure discussed in Part 3 is not optional: without it, drift is invisible until a user complains or a stakeholder notices.
Technical evaluation matters enormously. It is also insufficient on its own, because the trust that determines whether an AI initiative succeeds or fails is not only technical — it is organisational.
Engineering teams can be confident in a system that business stakeholders, customer-facing teams, and end users do not trust. That gap kills AI projects. It produces the situation where a technically sound system sits underused because the people who are supposed to benefit from it have not been given the evidence they need to rely on it.
Closing that gap requires translating technical quality metrics into terms that are meaningful to non-technical stakeholders. Precision and recall scores mean something to an ML engineer. They mean nothing to a customer service manager deciding whether to let the AI handle tier-one queries. What that manager needs is: out of a hundred responses, how many were good enough to send without review? How often does it get it wrong, and what kind of wrong? Those answers come from the same evaluation work — they just need to be presented differently.
It also requires honesty about limitations. The fastest way to destroy stakeholder trust in an AI system is to oversell its capabilities and let the gaps surface in production. Setting accurate expectations upfront — here is what the system does well, here is where it struggles, here is how we know — builds a more durable foundation than an optimistic launch followed by a retreat.
The teams that get this right never really finish evaluating. Every production query is a signal. Every human review decision is training data for better evaluation criteria. Every drift event that gets caught and corrected makes the system more robust.
This compounding is the competitive advantage. A system that has been running in production for a year with good evaluation infrastructure is substantially more reliable than one that was well-built at launch but never systematically monitored. The gap widens over time. Teams that invested in evaluation infrastructure early keep improving. Teams that shipped and moved on find themselves rebuilding trust after incidents that could have been caught.
The question worth asking before you go live is not just “does this work now?” It is “do we have the infrastructure to know when it stops working?” That infrastructure — the test suite, the evaluation rubric, the monitoring, the human review loop — is what makes an AI system something you can stand behind in production, rather than something you hope works and deal with when it does not.
The next post in this series covers what to do when something goes wrong — incident response for AI systems, and how to build the runbooks and culture that make a bad day recoverable.
If you are building an AI system and are uncertain about how to structure your evaluation approach — or are already in production and suspect you have drift you cannot see — get in touch. Getting this right before an incident is substantially easier than rebuilding confidence after one.