Shipping a production RAG pipeline with Azure AI Search and Foundry

A RAG prototype is one of the easiest AI demos to build — chunk some documents, embed them, throw them at Azure AI Search, wire up a chat completion call. It's also one of the easiest things to ship that looks production-ready and isn't. Here's what actually changed when I took a prototype through to a system handling real traffic.

Chunking strategy stops being an afterthought

In the demo, I split documents by character count and moved on. In production, chunking became the single highest-leverage thing to get right. Fixed-size chunks split tables mid-row, cut code blocks in half, and separated headings from the content they describe — all of which tank retrieval quality on exactly the queries that matter most.

What worked: structure-aware chunking that respects document boundaries (headings, list items, table rows, code fences), with a modest overlap (10–15%) so context isn't lost at chunk edges. It's more engineering work upfront. It's also where most of the retrieval quality actually comes from — far more than swapping embedding models.

Hybrid search beat pure vector search, consistently

Pure vector search is great at "find me something semantically similar." It's surprisingly bad at exact matches — product codes, error messages, acronyms, names. Azure AI Search's hybrid mode (vector + keyword + semantic ranking) consistently outperformed vector-only retrieval in evaluation, especially on the technical and reference-heavy content that made up most of our corpus. If you're running vector-only in production, this is the first thing I'd change.

Re-ranking earns its cost

Initial retrieval returns the top-K chunks by similarity. A re-ranking pass — using a smaller, cheaper model to score relevance of the top 20–30 candidates and pick the best 5 — measurably improved answer quality, at a fraction of the cost of asking the main model to sift through more context. It's an extra hop in the pipeline. It's worth it.

Citations are not optional in production

In the demo, answers were just text. In production, every claim in the response needs to trace back to a specific source chunk — both because users ask "where did you get that?" and because it's the fastest way to debug a wrong answer. Building citation tracking in from day one is dramatically easier than retrofitting it once the pipeline is load-bearing.

Foundry's evaluation tooling closed the loop

The hardest production question for any RAG system is "did our last change make answers better or worse?" Microsoft Foundry's evaluation flows let us run a fixed question set against the pipeline on every change — scoring groundedness, relevance, and retrieval precision automatically — and catch regressions before they reached users. Without an evaluation harness, every change to chunking, prompts, or retrieval parameters is a guess. With one, it's a measurement.

What I'd tell someone starting today

Build the simplest possible version first, get it in front of real questions as fast as you can, and instrument it so you can see where it fails — retrieval missing the right chunk, or the model misreading chunks it did retrieve. Those are different problems with different fixes, and conflating them is the single biggest time-sink I see teams fall into.

Closing thought

The gap between a RAG demo and a RAG product isn't the language model — it's everything around it: chunking, retrieval strategy, re-ranking, citations, and the evaluation harness that tells you whether your changes are actually improvements. Budget time for those, not just the integration call.