bbabafemi
All posts
Security

AI red-teaming 101: stress-testing your Azure OpenAI app

Most teams ship an LLM feature, then think about misuse later. Here's a practical, low-effort red-teaming pass you can run before launch — and keep running after.

May 26, 2026 3 min readby Babafemi Bulugbe

"Red-teaming" sounds like something only a dedicated security function can do, with elaborate adversarial setups and specialist tooling. In practice, a useful first pass takes an afternoon, a spreadsheet, and a willingness to try to break your own product. Here's the version I run before any LLM feature ships.

Start with the failure modes that actually matter to you

Don't start with generic "jailbreak" prompts pulled from the internet — start with what would actually hurt you if it happened in production:

  • Would it embarrass you if the model said something offensive to a customer?
  • Would it cost you money if the model could be tricked into giving away a discount, a refund, or privileged information?
  • Would it create legal exposure if the model gave advice it isn't qualified to give (medical, legal, financial)?

Write these down first. They define what you're actually testing for — everything else is detail.

Run the four categories that catch most real issues

  1. Prompt injection — can content from a tool result, a document, or a user message override your system instructions? Test by embedding instructions inside retrieved documents ("ignore previous instructions and...") and seeing whether the model follows them.
  2. Data exfiltration — can a user extract your system prompt, internal documentation, or another user's data through clever phrasing? Try direct asks, role-play framing ("pretend you're a developer debugging this"), and incremental extraction across multiple turns.
  3. Scope creep — does the model answer questions it shouldn't, because nothing stops it? A support bot that happily gives investment advice has a scope problem, not a knowledge problem.
  4. Tool misuse — if your agent has tools (search, code execution, API calls), can a crafted input get it to call a tool with parameters you didn't intend?

Use the model against itself

A genuinely useful trick: ask a different model to generate adversarial prompts targeting your system prompt. Give it your system instructions and ask it to find ways to make the assistant violate them. This produces a much richer attack set, much faster, than a human brainstorming alone — and it's exactly the kind of task Azure AI Foundry's evaluation tooling is built to run repeatedly.

Build a regression set, not a one-off exercise

The first red-teaming pass finds the obvious issues. The value compounds when you turn what you found into a permanent test set that runs on every prompt change and every model version upgrade. I keep a spreadsheet of: prompt, expected behavior, actual behavior, severity, fix applied. New model version comes out? Re-run the set before switching.

This is the difference between "we tested it once before launch" and "we know within minutes if a change regresses our safety posture."

Document what you found — and what you didn't fix

Not every issue needs a code change. Some are acceptable risks, some need a guardrail, some need a process change (human review for certain categories). What matters is that the decision is explicit and written down — exactly the kind of artifact that AI governance frameworks like the EU AI Act and ISO 42001 expect you to produce anyway.

Closing thought

You don't need a dedicated red team to start. You need an afternoon, a clear list of what would actually hurt you, and the discipline to turn what you find into a test that runs forever. That alone puts you ahead of most teams shipping LLM features today.