bbabafemi
All posts
AI

Building a private RAG with Azure AI Search and Azure OpenAI

An end-to-end blueprint for a retrieval-augmented chat over your own documents — locked down behind Private Endpoints and identity, deployed entirely on Azure.

December 2, 2025 4 min readby Babafemi Bulugbe

If you've built a RAG demo on a laptop, building one for actual production on Azure looks similar but has a different center of gravity. Identity, networking, and operational concerns dominate. The model itself is almost an afterthought.

Here's the architecture I deploy when a team asks me for a "private chat over our docs" build.

The components

[User] → [App Service / Container App with Entra ID auth]
              │
              ├──→ [Azure OpenAI] (chat completion + embeddings)
              │       Private Endpoint
              │       Customer-managed key
              │       Content filtering tuned to domain
              │
              ├──→ [Azure AI Search] (vector + hybrid retrieval)
              │       Private Endpoint
              │       Indexer with managed identity
              │
              └──→ [Storage Account] (source documents)
                     Private Endpoint
                     Hierarchical namespace, soft delete, immutability

Everything is private. Everything authenticates with managed identities. There are zero API keys.

Step 1: Land the documents in Storage

Source documents go into a hierarchical namespace storage account. Two folders: inbox/ (where new docs land) and processed/ (where they go after chunking).

The container has:

  • Soft delete at 30 days.
  • Immutability for compliance-grade docs.
  • Customer-managed key if your compliance team asks.
  • Private Endpoint so the storage account isn't reachable from the public internet.

Create a Standard tier Search service (the Free tier doesn't support managed identity). Enable:

  • Semantic ranking — improves relevance noticeably for natural language queries.
  • Vector search — required for embeddings.
  • Managed identity — both system-assigned and a user-assigned identity for the indexer.

Two indexes:

  • documents-chunk-index — one document per chunk. Fields: id, parentId, title, content, contentVector, metadata.
  • documents-summary-index — one document per source file, for showing search results back to the user with full context.

Step 3: Indexing pipeline

Don't write your own chunking pipeline if you don't have to. Azure AI Search has a built-in indexer that:

  1. Pulls documents from your storage container.
  2. Cracks PDFs and Office docs.
  3. Splits into chunks with configurable size/overlap.
  4. Calls Azure OpenAI for embeddings.
  5. Writes chunks to your search index.

Configured via skillset JSON — a one-time setup, then it runs on a schedule.

For documents that need custom processing (your specific table extraction, your specific cleansing rules), wrap the steps in an Azure Function and chain it as a custom skill.

Step 4: The retrieval logic

This is the part most demos get wrong. The naive approach — embed the query, find top-k similar chunks, stuff into prompt — is fine but not great.

What I do instead:

  1. Hybrid search. Combine vector similarity with BM25 keyword matching. Hybrid scores beat pure vector on almost every benchmark I've tested.
  2. Semantic ranking. Re-rank the top 50 hybrid results down to top 5 using Azure AI Search's semantic ranker. Costs marginally more, dramatically improves precision.
  3. Diversification. If multiple chunks come from the same parent document, deduplicate to ensure context breadth.
  4. Citation tracking. Track which chunk produced which fact in the answer, so the UI can render footnote links.

Step 5: The prompt

Two things that matter:

  • A system prompt that explicitly instructs grounding. "Only answer using the context below. If the context is insufficient, say so. Cite sources."
  • A separate evaluation prompt for faithfulness. Run it asynchronously on a sample of production responses to detect drift.

Step 6: Identity and networking

Every component talks to every other component via:

  • Private Endpoints — no traffic on the public internet.
  • Managed identities — no keys.
  • Custom RBAC roles — least privilege between components.

The web app's managed identity gets:

  • Cognitive Services OpenAI User on the Azure OpenAI resource.
  • Search Index Data Reader on the AI Search service.

The indexer's managed identity gets:

  • Storage Blob Data Reader on the source storage.
  • Cognitive Services OpenAI User on Azure OpenAI (for embeddings during indexing).
  • Search Service Contributor to manage its own indexes.

Step 7: Observability

App Insights for the web app, Diagnostic Settings on Azure OpenAI and AI Search to a Log Analytics workspace. Custom telemetry on:

  • Retrieval latency.
  • Token usage per turn.
  • Whether the response cited any retrieved chunks.

That last metric is your canary for "model is hallucinating without grounding."

What this isn't

  • It's not the cheapest way to do RAG. If cost is your primary concern, you'll want a smaller search tier and a cheaper embedding model.
  • It's not the fastest. Each turn does retrieval + re-rank + completion — typically 1.5–3 seconds end-to-end.
  • It's not a one-day build. Allow 2–4 weeks for a polished v1, longer for compliance hardening.

What it is, is boring. It's the production-grade architecture that actually holds up to a security review.

If you want to ship a RAG product to enterprise customers in 2026, this is the floor.