Building a private RAG with Azure AI Search and Azure OpenAI
An end-to-end blueprint for a retrieval-augmented chat over your own documents — locked down behind Private Endpoints and identity, deployed entirely on Azure.
If you've built a RAG demo on a laptop, building one for actual production on Azure looks similar but has a different center of gravity. Identity, networking, and operational concerns dominate. The model itself is almost an afterthought.
Here's the architecture I deploy when a team asks me for a "private chat over our docs" build.
The components
[User] → [App Service / Container App with Entra ID auth]
│
├──→ [Azure OpenAI] (chat completion + embeddings)
│ Private Endpoint
│ Customer-managed key
│ Content filtering tuned to domain
│
├──→ [Azure AI Search] (vector + hybrid retrieval)
│ Private Endpoint
│ Indexer with managed identity
│
└──→ [Storage Account] (source documents)
Private Endpoint
Hierarchical namespace, soft delete, immutability
Everything is private. Everything authenticates with managed identities. There are zero API keys.
Step 1: Land the documents in Storage
Source documents go into a hierarchical namespace storage account. Two folders: inbox/ (where new docs land) and processed/ (where they go after chunking).
The container has:
- Soft delete at 30 days.
- Immutability for compliance-grade docs.
- Customer-managed key if your compliance team asks.
- Private Endpoint so the storage account isn't reachable from the public internet.
Step 2: Configure Azure AI Search
Create a Standard tier Search service (the Free tier doesn't support managed identity). Enable:
- Semantic ranking — improves relevance noticeably for natural language queries.
- Vector search — required for embeddings.
- Managed identity — both system-assigned and a user-assigned identity for the indexer.
Two indexes:
documents-chunk-index— one document per chunk. Fields:id,parentId,title,content,contentVector,metadata.documents-summary-index— one document per source file, for showing search results back to the user with full context.
Step 3: Indexing pipeline
Don't write your own chunking pipeline if you don't have to. Azure AI Search has a built-in indexer that:
- Pulls documents from your storage container.
- Cracks PDFs and Office docs.
- Splits into chunks with configurable size/overlap.
- Calls Azure OpenAI for embeddings.
- Writes chunks to your search index.
Configured via skillset JSON — a one-time setup, then it runs on a schedule.
For documents that need custom processing (your specific table extraction, your specific cleansing rules), wrap the steps in an Azure Function and chain it as a custom skill.
Step 4: The retrieval logic
This is the part most demos get wrong. The naive approach — embed the query, find top-k similar chunks, stuff into prompt — is fine but not great.
What I do instead:
- Hybrid search. Combine vector similarity with BM25 keyword matching. Hybrid scores beat pure vector on almost every benchmark I've tested.
- Semantic ranking. Re-rank the top 50 hybrid results down to top 5 using Azure AI Search's semantic ranker. Costs marginally more, dramatically improves precision.
- Diversification. If multiple chunks come from the same parent document, deduplicate to ensure context breadth.
- Citation tracking. Track which chunk produced which fact in the answer, so the UI can render footnote links.
Step 5: The prompt
Two things that matter:
- A system prompt that explicitly instructs grounding. "Only answer using the context below. If the context is insufficient, say so. Cite sources."
- A separate evaluation prompt for faithfulness. Run it asynchronously on a sample of production responses to detect drift.
Step 6: Identity and networking
Every component talks to every other component via:
- Private Endpoints — no traffic on the public internet.
- Managed identities — no keys.
- Custom RBAC roles — least privilege between components.
The web app's managed identity gets:
Cognitive Services OpenAI Useron the Azure OpenAI resource.Search Index Data Readeron the AI Search service.
The indexer's managed identity gets:
Storage Blob Data Readeron the source storage.Cognitive Services OpenAI Useron Azure OpenAI (for embeddings during indexing).Search Service Contributorto manage its own indexes.
Step 7: Observability
App Insights for the web app, Diagnostic Settings on Azure OpenAI and AI Search to a Log Analytics workspace. Custom telemetry on:
- Retrieval latency.
- Token usage per turn.
- Whether the response cited any retrieved chunks.
That last metric is your canary for "model is hallucinating without grounding."
What this isn't
- It's not the cheapest way to do RAG. If cost is your primary concern, you'll want a smaller search tier and a cheaper embedding model.
- It's not the fastest. Each turn does retrieval + re-rank + completion — typically 1.5–3 seconds end-to-end.
- It's not a one-day build. Allow 2–4 weeks for a polished v1, longer for compliance hardening.
What it is, is boring. It's the production-grade architecture that actually holds up to a security review.
If you want to ship a RAG product to enterprise customers in 2026, this is the floor.