Kymeca
AI Strategy

AI Readiness: Why Data Architecture Comes Before AI Architecture

Mark 8 min read Part 2 of 5 — AI Readiness
Aerial view of the Peace Bridge over the River Foyle in Derry, Northern Ireland, symbolising the infrastructure connecting data systems.

Peace Bridge, Derry, Northern Ireland — Adam Montgomery / Unsplash

In the first post in this series, we mapped the four challenges that determine whether an AI initiative succeeds or stalls. Data readiness was first on the list — and typically the longest lead-time item in any AI initiative.

The Sequence Problem

If you’ve ever tried to transition an AI application from dummy data to your production database, you’ve probably hit “the wall.” The API calls are clean and the model behaves perfectly in isolation. But the moment you point it at your actual customer records—or the messy Zendesk knowledge base you’ve been building for three years—the magic fades. It doesn’t throw a 500 error. It just starts returning subtly wrong outputs. The retrieval misses obvious context, and the summaries start contradicting each other.

When this happens, engineers often blame the LLM. But nine times out of ten, the model is doing exactly what it was asked to do.

AI systems amplify whatever they are given. A model retrieves, generates, and reasons over its inputs — it does not audit them. If the underlying data is inconsistent, incomplete, or undocumented, the model’s outputs inherit those problems, often in ways that are harder to detect than a direct query returning a null result. A bad SQL query fails loudly. A model reasoning over bad data fails quietly, and with confidence.

This is why the sequence matters — not as a process preference or an architectural formality, but as a technical reality. The decisions that define your AI system — which retrieval pattern to use, how to structure context, which model to deploy — are all downstream of the state of your data. You cannot architect the AI layer well without first knowing what you are building it on.

What Data Readiness Actually Means in an AI Context

The phrase gets used loosely. In a BI or analytics context, data readiness typically means clean dimensions, consistent date grains, and reliable aggregations. That is a meaningful bar. In an AI context, the bar is different — and in some ways higher.

Schema consistency. Do the same real-world concepts appear the same way across your data sources? A customer who appears as customer_id, client_id, and account_number in three different systems is a single entity to a human who knows the context. To a retrieval system, they may be three unrelated records. The model will not reconcile them unless you have.

Provenance and lineage. Do you know where each record came from, and when it was last updated? In a reporting context, this matters for auditability. In an AI context, it matters for correctness — a model answering a question about current state using a record that was last updated eighteen months ago will give a confident, plausible, and wrong answer.

Business rule documentation. This is the one that catches teams most off guard. Every data set carries implicit rules — the conditions under which one record supersedes another, the edge cases that were handled by a custom script someone wrote in 2019, the field that looks like a boolean but actually has five meaningful states. These rules exist. In most organisations, they live in the heads of two or three engineers. A model has no access to institutional memory that was never written down.

Completeness and coverage. Not just the presence of data — the right data. A ten-year record set with a three-year gap in it will produce a model that is confident in exactly the wrong places.

None of these are new problems. What is new is that AI systems surface them in unfamiliar ways — not as query failures or broken dashboards, but as subtly unreliable outputs that are difficult to trace and even harder to explain to stakeholders.

Three Failure Patterns

These aren’t hypothetical. Some version of each turns up reliably in production.

Fragmented Schemas. Take a retrieval system built over a legacy product catalogue. It works flawlessly in development, where the engineers know the quirks of the data well enough to accidentally sidestep them. When it goes live, real users ask questions that cross the boundary between two source systems that represent the same products differently. The retrieval returns results that are technically present but contextually useless. The fix isn’t a better prompt; it requires rebuilding the data model, which takes weeks.

Hidden Business Logic. An AI assistant hooked up to a CRM to answer account queries keeps giving customers confident, totally incorrect answers about their subscription status. The underlying data is fine — but account status is governed by a rule that combines three fields in a non-obvious way, a workaround written into the codebase five years ago that nobody thought to document. Because that rule lives in an engineer’s head and not in the data layer, the model has no chance of getting it right.

Gaps in Historical Coverage. Another common trap is building an AI assistant over historical ticket data to help new support agents. It crushes common, recent issues. But for anything involving older customers or legacy products that predated a major database migration, it either returns nothing or hallucinates. The model doesn’t know what it doesn’t know.

What Good Looks Like Before You Build

The goal is not perfect data — that bar does not exist, and waiting for it guarantees you never ship anything. The goal is data whose limitations are known, documented, and accounted for in your system’s design.

A canonical data model for the entities your AI will reason about. Not necessarily a full data warehouse. A clear, agreed definition of the core objects in your domain — customer, product, order, case, whatever is central to your use case — their attributes, and the relationships between them. This is the map the model navigates. Without it, you are asking a model to reason over terrain it cannot see clearly.

Observable pipelines. Logged, traceable, auditable. If a model produces an output you cannot explain, you need to be able to follow the chain back to the source — and that requires data engineering work that most teams treat as optional until it becomes urgent.

Documented semantics. The business rules, edge cases, and contextual knowledge that live outside your database schema but are essential for correct interpretation. This does not have to be a formal data dictionary — it can start as a shared document that the relevant engineers maintain. The requirement is that it exists outside anyone’s head and is accessible to the systems reasoning over your data.

A Practical Starting Point

If you want to understand your actual data readiness position, skip the maturity frameworks. Pick the single most important entity in your domain — the one thing your AI system will most often be asked to reason about — and ask:

  1. How many distinct representations of this entity exist across your data sources?
  2. What rules govern when one representation takes precedence over another?
  3. Where are those rules documented?
  4. When did you last verify that the documentation reflects reality?

The answers will tell you more about the work ahead than any audit checklist. If the answer to question three is “in Dave’s head” and the answer to question four is “we haven’t,” you have a clear starting point. The good news is that the work is tractable — it is mostly documentation and decisions, not rebuilding infrastructure.

The Competitive Advantage Nobody Talks About

The teams that ship AI systems that actually work in production are not, in most cases, the ones with the best models or the biggest infrastructure budgets. They are the ones who did the unglamorous data work first — who spent time before they started on the AI layer understanding what their data actually said, what it didn’t say, and what the rules were for interpreting the difference.

That work does not show up in a demo. It does not make a good conference talk. It is the kind of thing that feels like a delay at the time and looks like good engineering in retrospect.

The next post in this series looks at infrastructure and cost — where your AI actually runs, and what it costs when real users start using it.

◆ FOCUS

If you’re at the stage where you know the data layer is the problem but you’re not sure where to start, get in touch — mapping this kind of complexity is exactly what we do.