Every enterprise AI strategy deck I have seen in the past years contains the same promise: “We will build a RAG-based knowledge assistant that lets employees query our internal documents in natural language.” The board nods. The budget gets approved. Six months later, the assistant gives wrong answers, misses critical information, and nobody trusts it.
The post-mortem always blames the model. “We need a better LLM.” “The AI is hallucinating.” “Maybe we should switch to a different vendor.”
The model is rarely the problem. The data pipeline is.
After more than 20 years of building enterprise data warehouses, I have watched this pattern repeat with every new technology wave. The tooling changes. The failure mode does not. Organizations invest in the shiny endpoint — the BI tool, the dashboard, the AI chatbot — and underinvest in the unglamorous work that makes it reliable: data ingestion, transformation, quality assurance, and governance. RAG is no exception.
What RAG Actually Is
The concept is simple. A large language model does not know anything about your company. When you ask it about your internal policies, your contracts, or your compliance rules, it either admits ignorance or invents a plausible-sounding answer. RAG fixes this by searching your actual documents before the model generates its response. The relevant passages get injected into the prompt, and the model answers based on your content rather than its training data.
The architecture has three layers: a data pipeline that ingests and prepares your documents, a retrieval system that finds the most relevant passages for a given question, and a language model that synthesizes a coherent answer. Everyone focuses on the third layer. The first layer is where projects succeed or fail.
The Data Pipeline Decides Everything
On Databricks, a RAG data pipeline follows the same medallion architecture you would use for any analytical workload. Raw documents – PDFs, Word files – land in Unity Catalog Volumes. A Spark Declarative Pipeline or Auto Loader ingests them incrementally. The native ai_parse_document function extracts structured text, handling OCR, multi-column layouts, tables, and even chart descriptions. The parsed output lands in Delta tables, governed by Unity Catalog.
So far, this is just data engineering. And the quality of this engineering determines the quality of every answer the system will ever produce.
Consider what happens when the parsing is poor. A compliance document with a two-column layout gets merged into a single stream of text, interleaving unrelated paragraphs. A table spanning two pages loses its header row. A chart showing risk exposure gets ignored entirely because the parser did not extract a text description. The downstream model receives corrupted input and produces corrupted outputs. No amount of prompt engineering or model upgrades will fix data that was corrupted at ingestion.
This is not a new problem. It is the same problem we have always had. In every data warehouse I have built, the most expensive defects originated at the source interface — not at the reporting layer.
Chunking Is a Data Modeling Decision
After parsing, the text must be split into smaller segments for retrieval. The industry calls this chunking. I call it a granularity decision because that is exactly what it is.
Every data warehouse professional has faced the question: at what level of detail should we store the data? Too coarse and you lose precision. Too fine and you lose context. The answer always depends on the business requirements and the access patterns. Chunking is the same tradeoff.
Split a document into chunks that are too large, and the retrieval returns vaguely related passages that dilute the answer. Split it into chunks that are too small, and individual fragments lack the context needed to be useful. Split it at arbitrary boundaries, every 500 characters, regardless of where sentences end, and you destroy the logical structure of the content.
The right approach is to split at meaningful boundaries: paragraphs, sections, or topic changes. Databricks supports this through libraries like LangChain and custom Python UDFs that respect the document structure produced by ai_parse_document. Techniques like chunk overlap, which carry forward a portion of the previous chunk, prevent information from being lost at the split point. More advanced methods use embedding models to detect where the topic changes and place boundaries accordingly.
These are data modeling decisions. They require understanding the content, the use case, and the access patterns. They require the same judgment we apply when designing fact tables, choosing partition keys, or defining aggregation levels. An AI engineer who has never built a data pipeline will default to a naive fixed-size split and wonder why retrieval quality is poor.
Governance Is Not Optional
In a data warehouse, access control is a first-class concern. Not every user should see every table. Not every report should include every data point. Regulatory requirements dictate who can access what, and the consequences of getting it wrong are severe.
RAG systems have exactly the same requirement, and most implementations ignore it entirely. If your knowledge base contains board-level strategy documents alongside general employee handbooks, every query hits the entire corpus. An intern asking about the vacation policy might receive chunks from a confidential M&A briefing — not because the model is malicious, but because the retrieval had no way to enforce access boundaries.
On Databricks, Unity Catalog provides the governance layer. The same permission model that controls access to tables and models also applies to the Volumes that store raw documents and the Delta tables that store processed chunks. This is not a feature that can be configured later. It is a design requirement from day one.
Metadata Makes or Breaks Retrieval
A retrieval system without metadata is like a data warehouse without indexes. It works, technically, but it performs poorly at scale and produces unreliable results.
Every chunk in your knowledge base should carry metadata: which document it came from, which section, when it was last updated, which department owns it, and what classification level it holds. Without this metadata, the retrieval system relies entirely on semantic similarity — a blunt instrument that cannot distinguish between a current compliance policy and an outdated draft, or between an approved procedure and a rejected proposal.
Databricks recently introduced what they call an Instructed Retriever, which uses metadata to filter and constrain retrieval before similarity search even begins. Instead of finding the ten most similar chunks and hoping they are relevant, it enforces rules like “only retrieve from compliance documents published after January 2025.” This is predicate pushdown for unstructured data — a concept that will feel immediately familiar to anyone who has optimized partition elimination.
Metadata extraction, enrichment, and maintenance are pure data engineering work. It is also the work that most RAG tutorials skip entirely.
Monitoring Is Where Discipline Shows
Most RAG implementations end at deployment. The chatbot goes live, someone writes a message announcing it, and the team moves on to the next project. Nobody monitors whether the retrieval quality degrades as new documents are added. Nobody checks whether updated policies have been re-indexed. Nobody measures whether the answers are actually correct.
In the data warehouse world, this would be unthinkable. We build reconciliation checks, data quality dashboards, and alerting for pipeline failures. We track data lineage so that when a report looks wrong, we can trace the problem back to its source.
Databricks provides Lakehouse Monitoring for RAG applications, which can scan outputs for hallucinations, toxic content, and quality degradation. But the monitoring architecture, what to measure, what thresholds to set, and how to handle drift, requires the same operational discipline we have been practicing for decades in data warehousing. The tooling is new. The mindset is not.
The Real Skill Gap
The AI community has no shortage of talent for model selection, prompt engineering, and fine-tuning. What it lacks — acutely — is talent for the foundational data work that determines whether an AI application delivers business value or becomes an expensive embarrassment.
Consider the skills a production RAG system requires: reliable data ingestion, incremental loading, and error handling. Document parsing that preserves structure and context. Chunking strategies that balance precision and recall. Metadata extraction and enrichment. Access control and data governance. Pipeline monitoring and data quality assurance. Version management for evolving documents.
This is a data engineering skill set. It is also, not coincidentally, the skill set of anyone who has spent years building and maintaining enterprise data warehouses.
The model is a commodity. You can swap it with a single configuration change. The data pipeline is the product. It is what determines whether your RAG system gives a compliance officer a trustworthy answer grounded in current policy — or a confidently wrong answer based on a corrupted chunk from an outdated draft.
If you have spent your career ensuring that the right data reaches the right people in the right form, you already possess the most critical skill for building AI systems that actually work. The technology has changed. The discipline has not.
Roland Wenzlofsky is the founder of DWHPro, a vendor-neutral data warehousing consultancy with over 20 years of enterprise experience across Teradata, Snowflake, and Databricks. He is a Teradata Certified Master and the author of “Teradata Query Performance Tuning.”
Related Services
⚡ Need Help Optimizing Your Data Platform?
We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.
Explore Our Services →📋 Considering a Move From Teradata?
Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.
Free Migration Assessment →