The reason most RAG pilots stall is not the model. It is the architecture. A demo built on a vector store and a prompt template can answer a clean question with a clean source. It cannot survive a real corpus, real users, and real questions that ask "what changed last quarter and who approved it."
This post is what a production RAG chatbot actually looks like: the components, the data flow, the failure modes, and the decisions that separate a working system from a stalled pilot.
The shape in one diagram
`` Sources (docs, wikis, tickets, DBs, APIs) │ ▼ ingestion (parse, chunk, enrich, embed) Indexes │ keyword (BM25) │ vector (embeddings) │ metadata (tags, ACLs, freshness) ▼ Retrieval (hybrid + filter) │ ▼ Reranker (cross-encoder) │ ▼ Generator (LLM with grounded prompt + citations) │ ▼ Response with sources, confidence, fallbacks │ ▼ Evaluation loop (queries, judgments, regressions) ``
Each layer is a real decision. Skip one and the failure mode shows up on day three.
Ingestion: where most RAG systems are already broken
Ingestion is the part that gets the least design attention and causes the most production pain. The questions to answer here.
What is a chunk? Splitting a contract by 512 tokens will break clauses in half. Splitting a knowledge base article by heading will respect structure. Splitting a meeting transcript by speaker turn will respect context. There is no universal chunking. Pick a strategy per source type.
What enrichment runs at ingest time? Titles, headings, authorship, dates, tags, summaries, related entities. The retriever can only filter on what you indexed, so enrich aggressively at ingest rather than hoping the model figures it out at query time.
What is the freshness model? Some sources change daily (tickets, CRM records). Some change rarely (policies, contracts). Treat them differently. A daily reindex of a stable policy corpus is wasted compute. A weekly reindex of an active ticket queue is stale-by-design.
What permissions does each chunk carry? Authorization at retrieval time means every chunk needs an ACL tag at index time. Bolting this on later means a costly reindex.
If ingestion is wrong, no retriever or reranker can save the system. The chunks the model needs do not exist in the form the retriever can find them.
Retrieval: hybrid, not just vectors
A pure vector search is the most common production mistake. Embeddings are great at semantic similarity. They are bad at exact terms (model numbers, proper nouns, error codes, dollar amounts) and at recency.
Production retrieval is hybrid.
Keyword retrieval (BM25 or similar) handles exact-match queries. "Error code E47" should return chunks containing E47, full stop.
Vector retrieval (embeddings) handles semantic queries. "Why is the deployment failing" should return chunks about deployment issues even when they do not contain those exact words.
Metadata filtering runs before both. Restrict to the right tenant, the right time window, the right ACL, the right document type. Filtering before retrieval is dramatically cheaper than filtering after.
The retriever combines the keyword and vector lists, typically using reciprocal rank fusion or a learned merge. The output is a candidate set, not a final answer.
Reranking: the layer most teams skip
The candidate set from retrieval is usually 50 to 200 chunks. The model cannot reason over 200 chunks in a prompt. You need to pick the best 5 to 15.
A reranker is a smaller model that scores each candidate for relevance to the query. Cross-encoder rerankers (which see the query and the candidate together) outperform the bi-encoder embeddings used for retrieval. The cost is latency and compute per candidate, which is why reranking comes after retrieval has narrowed the set.
Teams that skip reranking usually compensate by stuffing more chunks into the prompt. That hurts quality. The model gets distracted by near-misses and the answer drifts. A reranker plus a tight top-K is more reliable than a wide pull and a long prompt.
Generation: grounded prompts and enforced citations
The generation step has three jobs: answer the question, ground every claim in the retrieved chunks, and cite the sources.
The prompt structure that holds up in production.
- The system message defines the persona, the scope, the refusal policy, and the citation format.
- The retrieved chunks are interleaved with explicit source identifiers the model can quote.
- The user query is restated cleanly.
- The output schema requires inline citations and a sources list.
The two things that fail without a tight prompt: the model invents facts not in the chunks (hallucination) and the model answers with no citation (untraceable). Both are caught at evaluation time if the eval set includes hostile queries that try to elicit each failure.
For high-stakes domains (regulated, financial, legal, medical), pair the prompt with a verifier step. The verifier reads the generated answer and the cited chunks and confirms each claim is supported. If a claim has no supporting chunk, the answer is rejected and rewritten or escalated.
Failure modes that kill pilots
We have seen the same failure modes again and again.
Retriever returns irrelevant chunks because the question is phrased differently from the corpus. Fix: query rewriting (let the model rewrite the user question into a retrieval-shaped query before searching) and synonym expansion at ingest.
Reranker is missing, so the prompt is full of near-misses. Fix: add the reranker. Cheaper than adding more model tokens.
Permissions leak. A user sees a chunk they should not. Fix: enforce ACLs at retrieval time using metadata filters, never after generation. Generation cannot un-leak a fact the retriever surfaced.
No evaluation loop. Quality drifts and nobody notices until users complain. Fix: a curated eval set of 100 to 500 representative questions with judged answers, run on every model and prompt change.
Stale corpus. The chatbot answers from last quarter's policy. Fix: ingestion freshness model plus visible timestamps on cited sources.
No fallback. The chatbot answers with low confidence and a wrong answer instead of saying "I do not have this." Fix: confidence threshold plus a graceful "no answer" path that routes to a human or a search interface.
When RAG is not the right shape
RAG is a retrieval-and-grounding pattern. It is not the answer to every AI question.
If the task is to act, not to answer, you want an agent with tools, not a RAG chatbot. If the task is arithmetic or aggregation over structured data, you want SQL or analytics tools, not a vector store. If the corpus is small and static, you can put it directly in the prompt and skip retrieval entirely.
The implementation pillar with the full architecture write-up is in RAG implementation. If you want to compare RAG against agent architectures, the deeper piece is AI agent vs chatbot.
Where to go next
CloudNSite builds RAG chatbots as systems, not prompts. The engagement covers ingestion design, hybrid retrieval, reranker selection, prompt engineering with enforced citations, evaluation harness, ACL propagation, and the observability needed to catch drift. We build and operate the system. The corpus stays on your side; the architecture comes with the engagement.
If the next step is scoping, the RAG implementation page is the right starting point.