Blue-Green Deployment: Ship Without Downtime
Blue-green deployment runs two identical production environments — one serving live traffic, one idle — and switches between them on each release. When the new version is verified on the idle environment, a router or load balancer flips all traffic in one operation. If something breaks, you flip back. The entire switchover takes seconds, not minutes.
This pattern predates containers, Kubernetes, and the cloud. Martin Fowler documented it years ago as a way to reduce deployment risk by keeping a hot standby that doubles as your next release target — applicable across every layer of the cloud stack. The mechanism is deliberately boring: two environments, one switch, instant rollback. What makes it valuable isn't cleverness — it's the operational simplicity of having a known-good environment you can always revert to.
#How blue-green deployment actually works
Traffic hits one environment (call it blue) while green sits idle. You deploy the next release to green, run smoke tests against it, then update the router to send all traffic to green. Blue becomes your rollback target and your next deployment destination.
The routing layer is the only moving part. In practice, this means updating a load balancer target group, changing a Kubernetes Service selector, or modifying a DNS record. No in-place upgrades, no pods restarting mid-request, no partial rollouts.
# Kubernetes blue-green: two Deployments, one Service
# Step 1: Both versions running, Service points to blue
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: app
image: registry.example.com/myapp:2.3.1
ports:
- containerPort: 8080
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: app
image: registry.example.com/myapp:2.4.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512MiThe Service selector determines which environment gets traffic:
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
version: blue # Change to "green" to switch traffic
ports:
- port: 80
targetPort: 8080One kubectl apply with version: green in the selector, and all traffic moves. One kubectl apply with version: blue, and you've rolled back. The entire mechanism is a label change.
#Blue-green vs canary vs rolling: when each one fits
These three strategies solve the same problem — getting new code into production safely — but they make different trade-offs around speed, cost, and blast radius.
| Dimension | Blue-green | Canary | Rolling |
|---|---|---|---|
| Traffic switch | All-at-once | Gradual (2% → 25% → 100%) | Pod-by-pod replacement |
| Rollback speed | Seconds (flip the selector) | Seconds (route back to stable) | Minutes (redeploy old version) |
| Infrastructure cost | 2x (two full environments) | ~1.1x (small canary pool) | 1x (in-place replacement) |
| Blast radius on failure | 100% of users see the problem | Only canary subset affected | Grows incrementally |
| Database compatibility | Both versions hit same DB | Both versions hit same DB | Both versions hit same DB |
| Best for | APIs, stateless services, high-confidence releases | User-facing features, A/B validation | Microservices, frequent small changes |
| Complexity | Low (two deployments + selector) | Medium (traffic splitting rules) | Low (built into Kubernetes default) |
Blue-green is the right call when you want zero ambiguity about what's running. There's never a moment where two versions serve traffic simultaneously. Canary deployment is better when you need to validate with a subset of real users before full rollout — feature changes that affect UX, pricing logic, or recommendation engines. Rolling updates are what Kubernetes does by default, and they're fine for most microservice deployments where individual pod restarts don't affect user sessions.
The dirty secret about canary deployments: they require proper observability. If you can't measure error rates per deployment version in real time, a canary rollout is just a slow blue-green with extra steps. You need metrics that distinguish canary traffic from stable traffic, and you need automated rollback triggers. Most teams that think they're doing canary are actually doing rolling updates with a percentage knob.
#Blue-green deployment on Kubernetes with Argo Rollouts
Native Kubernetes doesn't have a built-in blue-green primitive. The Service-selector approach shown above works, but it's manual. Argo Rollouts adds a Rollout resource that automates the full lifecycle: deploy the new version, run pre-promotion analysis, switch traffic, run post-promotion analysis, and scale down the old version.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
replicas: 3
revisionHistoryLimit: 2
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: app
image: registry.example.com/myapp:2.4.0
ports:
- containerPort: 8080
strategy:
blueGreen:
activeService: myapp-active
previewService: myapp-preview
autoPromotionEnabled: false
scaleDownDelaySeconds: 60
prePromotionAnalysis:
templates:
- templateName: smoke-tests
args:
- name: service-name
value: myapp-preview
postPromotionAnalysis:
templates:
- templateName: error-rate-check
args:
- name: service-name
value: myapp-activeWith autoPromotionEnabled: false, the rollout pauses after deploying the new version and passing pre-promotion analysis. Your team inspects the preview environment, runs manual QA if needed, then promotes with kubectl argo rollouts promote myapp. If pre-promotion analysis fails, the rollout aborts automatically — traffic never moves.
The scaleDownDelaySeconds field controls how long old pods stay alive after promotion. Set this to at least 30 seconds to allow iptables propagation across cluster nodes. On large clusters with 50+ nodes, 60 seconds is safer.
On platforms like AZIN that deploy to GKE Autopilot, Argo Rollouts runs natively since GKE is full Kubernetes under the hood. The same manifests work whether you manage the cluster yourself or use a BYOC platform that abstracts the infrastructure.
#The database problem nobody warns you about
Blue-green deployment handles stateless services gracefully. Databases are where it gets hard.
Both environments share the same database. When you deploy a schema change to green, blue must still be able to read and write to that database — because blue is your rollback target. If green adds a NOT NULL column that blue doesn't know about, every write from blue fails the moment you need to roll back.
The fix is the expand-contract pattern. Split every breaking schema change into three phases:
Expand: Add the new column as nullable, or add the new table. Both application versions can operate against this schema. Deploy this migration independently, before the application change.
-- Phase 1: Expand (deploy before the app change)
ALTER TABLE orders ADD COLUMN tracking_number VARCHAR(64);
-- Column is nullable, so v2.3 (blue) ignores it, v2.4 (green) writes to itMigrate: Deploy the new application version (green) that reads and writes the new column. Both versions coexist safely because the column is nullable. Run a backfill if needed.
Contract: After the old version is fully retired and you're confident in the new release, drop the old column or add the NOT NULL constraint. This is a separate migration, deployed in a later cycle.
-- Phase 3: Contract (deploy weeks later, after blue is retired)
UPDATE orders SET tracking_number = 'UNKNOWN' WHERE tracking_number IS NULL;
ALTER TABLE orders ALTER COLUMN tracking_number SET NOT NULL;This adds a deployment cycle to every schema change. It's the price of instant rollback. Teams that skip it learn the hard way during their first production rollback when INSERT statements start throwing constraint violations.
#Five failure modes that catch experienced teams
1. Infrastructure drift between environments. Blue and green run different OS patch levels, different TLS library versions, or different sidecar configurations. Tests pass on green, but green's OpenSSL version handles a cipher suite differently than blue did. Use infrastructure-as-code for everything, and rebuild both environments from the same source periodically.
2. Forgetting about long-lived connections. WebSocket connections, gRPC streams, and database connection pools don't respect load balancer switches. Clients connected to blue stay connected to blue even after you flip to green. You need connection draining with a timeout — typically 30-60 seconds — and clients that handle reconnection gracefully.
3. Background jobs completing on the old environment. A Sidekiq worker on blue picks up a job right before the switch. It finishes processing after traffic moves to green, writing to the database in a state that green's code doesn't expect. Solution: drain job queues before switching, or make jobs idempotent so they can safely run on either version.
4. Shared state in caches. Blue writes a serialized object to Redis using format v1. Green reads it and tries to deserialize with format v2. The cache is shared, so both versions hit the same keys. Namespace your cache keys with the application version, or use cache formats that are backward-compatible.
5. Health check false positives. Green's health check returns 200, so you promote it. But the health check only verifies the app process is running — it doesn't check database connectivity, external API auth, or whether the message queue consumer is actually consuming. Write health checks that verify the dependency chain, not just the process.
#Is blue-green deployment worth the infrastructure cost?
Running two production environments doubles your compute spend during the deployment window. Whether that cost is justified depends on what downtime costs you.
For a B2B SaaS processing payments, 60 seconds of downtime during a rolling update might trigger SLA violations worth more than a month of duplicate infrastructure. For an internal dashboard that three people check once a day, blue-green is overkill — a rolling update with a 5-second readiness probe is fine.
The cost math changes significantly on Kubernetes. With pod-level autoscaling, the idle environment can scale to minimum replicas (or even zero with KEDA) between deployments. You're not paying for a full duplicate cluster 24/7 — you're paying for a few idle pods that spin up to full capacity only during the deployment window. On managed Kubernetes platforms, this makes blue-green viable even for smaller teams.
Preview environments reduce the cost argument further. If you already run a full-stack preview per pull request, you're already paying for temporary duplicate infrastructure. Blue-green is conceptually the same pattern applied to production instead of staging.
#Automating the full cycle
A production-grade blue-green pipeline needs more than a label swap. Here's what the automation looks like end-to-end:
# .github/workflows/blue-green-deploy.yml
name: Blue-Green Deploy
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build and push image
run: |
docker build -t $REGISTRY/myapp:${{ github.sha }} .
docker push $REGISTRY/myapp:${{ github.sha }}
- name: Deploy to inactive environment
run: |
ACTIVE=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}')
INACTIVE=$([[ "$ACTIVE" == "blue" ]] && echo "green" || echo "blue")
kubectl set image deployment/app-$INACTIVE \
app=$REGISTRY/myapp:${{ github.sha }}
kubectl rollout status deployment/app-$INACTIVE --timeout=120s
- name: Run smoke tests against inactive
run: |
INACTIVE_URL=$(kubectl get svc myapp-preview -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -sf "$INACTIVE_URL/healthz" || exit 1
npm run test:smoke -- --base-url="http://$INACTIVE_URL"
- name: Switch traffic
run: |
ACTIVE=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}')
INACTIVE=$([[ "$ACTIVE" == "blue" ]] && echo "green" || echo "blue")
kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$INACTIVE\"}}}"
- name: Verify production
run: |
sleep 10
curl -sf "https://myapp.example.com/healthz" || {
echo "Production health check failed, rolling back"
kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$ACTIVE\"}}}"
exit 1
}This handles the happy path. For production use, add Slack notifications on switch and rollback, a manual approval gate before the traffic switch (GitHub Environments work well for this), metric comparison between pre and post-switch error rates, and a scheduled job that scales down the inactive environment after a cooldown period.
If managing deployment pipelines and Kubernetes manifests isn't how you want to spend your time, BYOC platforms handle this infrastructure layer. AZIN deploys to GKE Autopilot in your own Google Cloud account, where Kubernetes-native deployment strategies — including blue-green — work out of the box. The Console manages the deployment lifecycle while you keep infrastructure ownership.
Deploy to your own cloud
AZIN deploys to GKE Autopilot in your Google Cloud account. Kubernetes-native deployment strategies, zero infrastructure management.
#When to skip blue-green entirely
Blue-green deployment is not a universal upgrade over other strategies. Skip it when:
Your services are stateful and tightly coupled to local disk. Databases, message brokers like Kafka, and distributed caches with persistence don't swap cleanly between environments. Use rolling updates with proper drain mechanisms instead.
You deploy more than 10 times a day. At that velocity, maintaining two synchronized environments creates more operational overhead than it saves. Feature flags with rolling deployments and progressive delivery give you fine-grained control without the environment duplication.
Your application is a monolith with 30-minute build times. Blue-green's value comes from fast switches. If building and deploying the inactive environment takes half an hour, your "instant rollback" has a 30-minute recovery time — which is worse than most rolling update strategies.
Your team doesn't have automated testing. Deploying to green and promoting without automated smoke tests is just deploying to production with extra steps. The pattern's safety depends entirely on validating the inactive environment before switching. Manual QA on every release defeats the speed advantage.
#Getting started with zero-downtime deployments
If you're on Kubernetes, start with the native Service-selector approach. Two Deployments, one Service, a label swap in your CI pipeline. Run it for a month. You'll quickly discover which failure modes apply to your architecture — and at what point Argo Rollouts, Flagger, or a managed platform becomes worth the added complexity.
If you're evaluating deployment platforms, look for native Kubernetes support. Platforms built on Kubernetes — like AZIN on GKE Autopilot or Porter on EKS — give you blue-green as a configuration option rather than a DIY project. Platforms like Railway and Render handle zero-downtime deploys through rolling updates, but don't expose blue-green as a distinct strategy.
The deployment strategy matters less than the discipline around it. Automated tests, backward-compatible database migrations, proper health checks, and a clear rollback procedure make any strategy work. Without those, even blue-green's instant rollback won't save you from a schema migration that corrupts data.
Auto-deploy into your own cloud
Push code, AZIN handles the rest. Auto-detected builds, your cloud account, no vendor lock-in.
Related
Managed Kubernetes Simplified — Deploy Without the DevOps Overhead
You don't need to learn Kubernetes to deploy to the cloud. Managed Kubernetes platforms like GKE Autopilot abstract away cluster operations entirely, and BYOC platforms take it further by removing even the kubectl.
devopsOOM Killed: What It Means and How to Fix It
OOM killed means the Linux kernel terminated your process for using too much memory. Learn how to diagnose and fix OOM kills in Docker, Kubernetes, and bare metal.