Self-hosted GitHub Actions runners on Azure: when, and how

GitHub-hosted runners are great. Until they aren't. For most teams, they should be the default but a few specific scenarios make self-hosted runners worth the operational cost.

Here's when I switch, and the patterns I use to avoid the obvious traps.

When to switch (and when not to)

Switch to self-hosted when:

You need access to a private network. Your tests need to reach a private database, a peered VNet, or an on-prem service. GitHub-hosted runners are public-internet only.
You need bigger machines or GPUs. You can scale GitHub-hosted runners now, but for very large builds or GPU jobs, your own machine is cheaper.
You have very high CI volume. Past several hundred CI hours per week, self-hosted starts paying back.
You need specific OS / hardware not on the GitHub menu. ARM workloads, Windows-on-ARM, niche Linux distros.

Stay on GitHub-hosted when:

You're a small team running standard builds.
Your security policy doesn't require private networks.
You don't have an SRE or platform engineer to run the runners.

The default should be GitHub-hosted. Self-hosted is an operational commitment.

The architecture I deploy on Azure

[GitHub Actions Workflow]
        │  (queues a job)
        ▼
[GitHub Actions Runner Controller]
        │  (provisions a runner pod just-in-time)
        ▼
[AKS or Azure Container App with managed identity]
        │
        ├─ Private endpoint to ACR (pull build images)
        ├─ Access to peered VNets (your private resources)
        └─ OIDC federated credential → Azure subscriptions

Three principles:

Ephemeral runners. Each job runs on a fresh runner that's destroyed afterwards. No state sneaking between builds.
Just-in-time provisioning. Runners exist only when there's work. Scale to zero when idle.
Identity, not keys. Runners authenticate via managed identity to Azure resources, never with stored credentials.

Implementation paths

Option A: ARC (Actions Runner Controller) on AKS.

The mature, recommended path for organizations that already run AKS. ARC manages runner lifecycles via Kubernetes operators. Pros: scales well, mature, well-documented. Cons: you need an AKS cluster.

Option B: Azure Container Apps with the GitHub Actions runner image.

Lighter-weight than AKS. Container Apps has built-in autoscaling and serverless billing. Pros: less infrastructure. Cons: less mature for this use case, slightly more complex networking.

Option C: Plain Azure VM Scale Set with the runner agent.

Old-school but works. A VMSS with the runner installed via custom script. Cons: scaling is slower and you're managing VMs.

For most teams I default to ARC on AKS if they already have AKS, Container Apps if they don't.

The security checklist

Self-hosted runners are sensitive. They run untrusted code (your CI is essentially trusted, but a malicious PR could try to abuse it). Mitigations:

Don't run on public repositories without pull_request_target restrictions. A forked PR shouldn't get to execute on your runner.
Use ephemeral runners only. Fresh OS image every time.
Limit what the runner's identity can do. Federated credentials scoped per workflow / branch / environment.
Network isolation. Runners in their own subnet, with explicit NSG rules. They should reach only what they need.
Image hardening. Use a minimal runner image (e.g., from actions/runner-images repo) and keep it patched. Don't use community runner images you can't audit.
Audit the secrets in scope. Self-hosted runners can see workflow secrets. Make sure your secrets are scoped per-environment, not all dumped in repo secrets.

Cost — the part that surprises people

Self-hosted runners can be cheaper or more expensive depending on usage:

High volume + always-on: GitHub-hosted billing dominates. Self-hosted wins.
Bursty + low utilization: GitHub-hosted wins. Idle self-hosted infrastructure still costs money.
Specialized hardware: Self-hosted wins almost always.

Run the numbers before migrating. A spreadsheet with average jobs/day, average job duration, runner specs, and Azure compute pricing will tell you in 20 minutes whether the move pays back.

What I always set up on day one

Diagnostic settings on the runner infra → Log Analytics workspace.
Auto-shutdown for any non-AKS infrastructure during off-hours.
A dashboard showing: jobs queued, jobs running, runners idle, runners failed.
An alert if a runner has been "running" for >2 hours (likely a stuck job consuming budget).
A runbook for "all runners are dead", the on-call playbook.

What to watch for after you migrate

Stale caches. Self-hosted runners persist disks across jobs by default. Useful for node_modules caching, but watch out for stale state. Use actions/cache explicitly rather than relying on filesystem persistence.

Secret drift. When you move workflows from GitHub-hosted to self-hosted, the secrets they need may differ (managed identity replaces some, but not all). Audit each workflow.

Concurrency limits. AKS clusters have node-pool quotas. Container Apps have replica limits. Hit these and jobs queue without obvious diagnostics. Set up alerting on queue depth.

The honest summary

Self-hosted runners are an operational choice. The technical setup is a couple of weeks; the running of them is forever. If your team has SRE or platform muscle, fine. If not, GitHub-hosted is almost always the better answer.

When in doubt, stay on GitHub-hosted. The day you genuinely need self-hosted, you'll know.