Challenge 33: design a highly available Multi-Region Application
90-120 min | Estimated cost: $20-40 | Exam Weight: 15-20%
Introduction
StreamFlix is a video streaming platform serving 50 million monthly active users across North America, Europe, and Asia-Pacific. The platform streams 4K video content, manages user profiles and watch history, processes real-time recommendations, and handles content licensing metadata. StreamFlix has positioned itself as the "always-on" alternative to competitors, promising users they will never experience a buffering screen or service unavailability.
The executive team has mandated a 99.99% composite SLA with less than 50ms video start time globally and the ability to survive a complete Azure region failure with less than 2 minutes of user-visible impact. The platform must be active in 3 regions simultaneously (East US 2, North Europe, Japan East), not in an active-passive configuration. Every region must serve production traffic at all times, and if any single region fails, the remaining two must absorb its traffic without degradation.
This is the Domain 3 capstone challenge. You will combine all high availability, backup, and disaster recovery concepts from Challenges 25-32 into a complete, production-grade multi-region architecture. You must calculate the composite SLA mathematically, prove it meets the 99.99% target, and demonstrate that every component has appropriate redundancy.
Exam skills covered
- Recommend a high availability solution for compute
- Recommend a high availability solution for relational data
- Recommend a high availability solution for semi-structured and unstructured data
- Recommend a recovery solution for Azure and hybrid workloads that meets recovery objectives
Design tasks
Part 1: global Traffic routing and edge layer
-
Design the global entry point using Azure Front Door:
- Configure 3 origin groups (East US 2, North Europe, Japan East)
- Routing method: latency-based (users routed to nearest healthy region)
- Health probes: HTTP on
/healthendpoint, 10-second interval, 3 failures = unhealthy - Calculate failover detection time: probe interval x failure threshold = ?
-
Design the CDN and caching strategy:
- Video content: serve from Azure CDN (or Front Door caching rules) with 24-hour TTL
- API responses: cache personalized data? (No - dynamic content bypasses cache)
- Static assets (UI, thumbnails): 7-day cache with cache-busting versioned URLs
- Calculate: what percentage of requests hit CDN cache vs. origin?
-
Document the failover behavior when one region fails:
- Time to detect: health probe interval x threshold
- Time to reroute: Front Door propagation (near-instant, anycast)
- User impact: requests in-flight to failed region fail, next request goes to healthy region
- Total user-visible disruption: approximately 30-60 seconds
Part 2: compute layer (Per-Region)
-
Design the compute architecture within each region:
- Web/API tier: Azure Kubernetes Service (AKS) or App Service (zone-redundant)
- Recommendation engine: Container Apps with autoscale
- Video transcoding: VMSS with spot instances (batch, not HA-critical)
-
For each region, configure zone redundancy:
- AKS with 3 availability zones, minimum 3 nodes (1 per zone)
- Node autoscaler: min 3, max 12 (absorb traffic from a failed region)
- Pod disruption budgets: minAvailable 66% (survive zone failure)
-
Calculate the per-region compute capacity required to absorb a failed region's traffic:
- Normal operation: each region handles 33% of global traffic
- During regional failure: each surviving region handles 50%
- Autoscaler headroom: each region must be able to scale to 150% of normal capacity within 2 minutes
- Design the autoscale triggers and pre-warming strategy
Part 3: Data layer (Multi-Region)
- Design the data architecture for each data type:
| Data Type | Service | Regions | Consistency | Failover |
|---|---|---|---|---|
| User profiles & watch history | Cosmos DB (NoSQL) | 3, multi-write | Session | Automatic (99.999%) |
| Content catalog & licensing | Azure SQL Database | 3 (1 primary + 2 read) | Strong | Failover group |
| Video files (4K content) | Blob Storage + CDN | 3 (RA-GRS) | Eventual | CDN cache + secondary read |
| Session tokens | Azure Cache for Redis | 3 (Enterprise, active geo) | Eventual | Cross-region replication |
| Recommendations (ML model cache) | Redis or Cosmos DB | 3 | Eventual | Per-region rebuild |
-
Configure Cosmos DB for the user profile and watch history workload:
- Multi-region writes (all 3 regions write locally)
- Session consistency (user sees their own writes immediately)
- Partition key strategy:
/userId(ensures user data is co-located) - Autoscale: 10,000 - 100,000 RU/s per region (traffic-dependent)
-
Design the SQL Database topology for the content catalog:
- Primary: East US 2 (Business Critical, zone-redundant, 16 vCores)
- Failover group secondary: North Europe (automatic failover, 1-hour grace)
- Active geo-replica: Japan East (read-only, manual failover)
- Justify why content catalog uses SQL (relational licensing constraints, complex queries) vs. Cosmos DB
-
Configure the video content storage for global delivery:
- Primary storage: East US 2 (RA-GRS, replicated to paired region)
- Secondary storage accounts in North Europe and Japan East for region-local content
- Azure CDN with multiple origin groups for failover
- Cache warming: pre-populate CDN cache for new releases before launch
Part 4: composite SLA calculation
- Calculate the composite SLA for the complete architecture:
Per-region SLA (serial dependencies):
- Azure Front Door: 99.99%
- AKS (zone-redundant): 99.95%
- Cosmos DB (multi-region write): 99.999%
- Azure SQL (Business Critical, zone-redundant): 99.995%
- Azure Cache for Redis (Enterprise): 99.99%
- Storage (RA-GRS): 99.99%
Per-region composite = Front Door x AKS x Cosmos DB x SQL x Redis x Storage
Multi-region SLA (parallel, active-active in 3 regions):
- Multi-region availability = 1 - (1 - per-region)^3
-
Perform the calculation:
- Per-region: 0.9999 x 0.9995 x 0.99999 x 0.99995 x 0.9999 x 0.9999 = ?
- Does this meet 99.99%? If not, what is the bottleneck?
- Multi-region: 1 - (1 - per-region)^3 = ?
- Does the multi-region architecture meet or exceed 99.99%?
-
If the single-region composite falls below 99.99%, demonstrate how the active-active multi-region deployment recovers the target:
- Even if per-region = 99.9%, multi-region = 1 - (0.001)^3 = 99.9999999%
- The multi-region active-active pattern compensates for lower per-region SLAs
- Document assumptions: Front Door must correctly detect and route around regional failures
Part 5: failure testing and operations
-
Design a chaos engineering approach to validate the architecture:
- Zone failure test: Simulate AZ failure, verify traffic redistributes within zone
- Region failure test: Disable one region's origin in Front Door, verify failover < 2 minutes
- Data failure test: Simulate Cosmos DB region unavailability, verify writes continue in other regions
- Cascading failure test: Simulate Redis failure causing increased DB load
-
Create operational runbooks for:
- Regional failover (automated via Front Door health probes)
- Data consistency verification after a region recovery
- Capacity validation (can 2 regions handle 100% of traffic?)
- Post-incident review template
-
Design the monitoring and observability strategy:
- Azure Monitor with cross-region dashboard
- Per-region health score (composite of all services in that region)
- Alerting: alert when any region drops below healthy threshold
- SLA tracking: monthly uptime calculation with automated reports
Success criteria
- ⬜Azure Front Door configured with 3 origin groups and latency-based routing with health probes
- ⬜Zone-redundant compute deployed in each region with autoscale to absorb regional failure
- ⬜Cosmos DB multi-region writes configured with appropriate consistency and conflict resolution
- ⬜SQL Database failover group and geo-replicas configured for content catalog
- ⬜Composite SLA calculated mathematically and proven to meet 99.99% target
- ⬜Chaos testing plan documented with specific failure scenarios and expected behavior
Hints
Hint 1: Composite SLA Math
Step-by-step calculation:
Per-region (serial): 0.9999 x 0.9995 x 0.99999 x 0.99995 x 0.9999 x 0.9999 = 0.99914 (approximately 99.914%)
This is BELOW 99.99% for a single region. Single-region deployment cannot meet the requirement.
Multi-region (parallel, 3 active regions): Per-region failure probability = 1 - 0.99914 = 0.00086 All-regions-fail probability = 0.00086^3 = 0.000000000636 Multi-region availability = 1 - 0.000000000636 = 99.9999999% (effectively 9+ nines)
The key insight: even though no single region meets 99.99%, three active regions together far exceed it. This is the fundamental value proposition of active-active multi-region architecture.
However, this assumes Front Door perfectly routes around failures. Front Door's own 99.99% SLA becomes the limiting factor: Effective SLA = Front Door SLA x Multi-region backend SLA = 0.9999 x ~1.0 = 99.99%
Hint 2: Active-Active vs Active-Passive Multi-Region
Active-Active (StreamFlix requirement):
- All regions serve production traffic simultaneously
- Failover is instant (traffic already flowing to other regions)
- Capacity must be pre-provisioned in all regions (higher cost)
- Data must be writable in all regions (multi-region writes)
- SLA formula: parallel (dramatically higher availability)
- More expensive but meets the < 2 minute recovery requirement
Active-Passive:
- One region serves traffic, others are standby
- Failover requires starting/scaling passive region (minutes to hours)
- Standby region costs less (minimal capacity until activated)
- Data only writable in primary region (simpler consistency)
- Cannot meet < 2 minute recovery for full workloads
- Less expensive for lower availability requirements
StreamFlix MUST use active-active to meet the < 2 minute recovery requirement because passive regions cannot scale to handle production traffic in under 2 minutes.
Hint 3: Front Door Failover Timing
Azure Front Door failover detection and routing:
- Health probe interval: configurable (5-255 seconds, default 30)
- Unhealthy threshold: configurable (typically 3 failures)
- Detection time = interval x threshold = 10s x 3 = 30 seconds (with recommended settings)
- Routing update: near-instant (anycast architecture, no DNS propagation)
Total failover time for StreamFlix:
- Detection: 30 seconds (health probe detects origin failure)
- Routing: < 1 second (Front Door removes unhealthy origin from rotation)
- In-flight requests: may fail (10-30 second timeout on client)
- User retry: next request succeeds via healthy origin
- Total user-visible impact: approximately 30-60 seconds (meets < 2 minute requirement)
Optimization: Set probe interval to 5 seconds with threshold of 3 = 15 second detection.
Hint 4: AKS Zone-Redundancy and Capacity Planning
AKS zone-redundant configuration for StreamFlix:
az aks create \
--resource-group rg-streamflix-eastus2 \
--name aks-streamflix-eastus2 \
--node-count 6 \
--zones 1 2 3 \
--enable-cluster-autoscaler \
--min-count 6 \
--max-count 18 \
--node-vm-size Standard_D8s_v5
Capacity planning:
- Normal load per region: 6 nodes handle 33% of traffic (50M users / 3 regions)
- Zone failure: 4 nodes handle 33% (AKS redistributes pods to surviving nodes)
- Region failure: remaining 2 regions scale to 9-12 nodes each to handle 50% of traffic
- Autoscaler trigger: CPU > 60% or memory > 70% -> add nodes
- Scale-up time: ~2-3 minutes for new nodes to be ready (meets < 2 min only if pre-warmed)
Pre-warming strategy: keep min-count at 9 instead of 6 (pays for 50% more baseline capacity but guarantees immediate absorption of regional failure without waiting for autoscaler).
Hint 5: Video Start Time < 50ms Globally
Achieving < 50ms video start time requires CDN caching to handle the vast majority of video requests:
- Azure CDN PoPs are within 10-30ms of most users globally
- First-byte latency from CDN cache: ~10-50ms (meets requirement)
- First-byte latency from origin (cache miss): 100-500ms (does NOT meet requirement)
- Strategy: ensure > 99% cache hit ratio for video segments
Cache architecture:
- Video content is segmented (HLS/DASH, 2-10 second chunks)
- First segment of popular content pre-cached globally
- Cache TTL: 24 hours minimum (content doesn't change)
- Cache warming: push new content to all CDN PoPs before release
- Origin shield: intermediate cache layer reduces origin load
For the < 50ms requirement to be met globally, the CDN is not optional - it's architecturally critical. Without CDN caching, cross-region latency alone would exceed 50ms for remote users.
Learning resources
- Azure Front Door routing architecture
- Multi-region web application - Azure Architecture Center
- Distribute data globally with Azure Cosmos DB
- Azure Well-Architected Framework - Reliability
- Composite SLA calculation
- AKS availability zones
Knowledge check
1. StreamFlix's per-region composite SLA is 99.914%. How does deploying active-active across 3 regions achieve 99.99%+ overall, and what component becomes the effective SLA ceiling?
With 3 active regions, the probability of ALL regions failing simultaneously is (1 - 0.99914)^3 = negligible, giving effective availability of ~99.9999999%. However, Azure Front Door's own 99.99% SLA becomes the ceiling because it is a single global service through which all traffic flows - it cannot be made redundant within Azure. The effective composite SLA is: min(Front Door SLA, multi-region backend SLA) = min(99.99%, ~100%) = 99.99%. Front Door is the limiting factor, not the backend infrastructure. To exceed 99.99%, you would need a multi-CDN strategy (Front Door + Cloudflare/Akamai), which adds significant operational complexity.
2. Each StreamFlix region runs at 33% capacity during normal operations. When a region fails, the other two must handle 50% each. Why might autoscaling alone be insufficient to meet the 2-minute recovery target?
AKS autoscaler takes 2-3 minutes to provision new nodes, which exceeds the 2-minute recovery budget. The autoscaler must: detect increased load (30-60 seconds), request new VMs from Azure (30-60 seconds), wait for VMs to join the cluster (30-60 seconds), and schedule pods onto new nodes (10-30 seconds). Total: 2-4 minutes. Solution: over-provision baseline capacity so each region runs at ~50% utilization normally (min-count = 9 instead of 6). This "warm capacity" immediately absorbs the additional load from a failed region without waiting for autoscaler. The trade-off is 50% higher baseline compute cost for guaranteed sub-2-minute failover.
3. StreamFlix uses Cosmos DB multi-region writes for user profiles. If a user updates their profile in East US 2 and immediately reads from Japan East, what do they see under Session consistency?
Under Session consistency with multi-region writes, the user sees their own update ONLY if they continue reading from the same region (East US 2). Session consistency guarantees are scoped to a single session token and a single region. If the user's next read is routed to Japan East (e.g., because they traveled or Front Door rerouted), they might see stale data until replication catches up (typically milliseconds to a few seconds). To guarantee read-your-own-writes globally, the application must pass the session token and route the read to the write region, or use Bounded Staleness with a tight window. In practice, this edge case rarely matters for profile reads.
4. The content catalog uses Azure SQL with a failover group (East US 2 -> North Europe) and a geo-replica (Japan East). If East US 2 fails, what happens in each region?
North Europe is automatically promoted to primary (via failover group, ~30 seconds), and Japan East's geo-replica breaks because its source (East US 2) is gone. After failover: North Europe handles all writes as the new primary. The failover group endpoint DNS updates automatically. Japan East's geo-replica must be re-created with North Europe as the new source. During the gap (minutes to hours), Japan East has stale read-only data from before the failure. Application design must handle this: Japan East can serve reads from its last good state while the geo-replica is re-established, or route writes through the failover group endpoint (higher latency from Japan East to North Europe). This is a known limitation of combining failover groups with additional geo-replicas.
Validation lab
Deploy a minimal proof-of-concept to validate your design:
- Create a resource group for this lab:
az group create --name rg-az305-challenge33 --location eastus
- Deploy a Traffic Manager profile with performance routing:
az network traffic-manager profile create \
--resource-group rg-az305-challenge33 \
--name tm-multiregion-lab \
--routing-method Performance \
--unique-dns-name tm-az305-challenge33-$RANDOM \
--protocol HTTP \
--port 80 \
--path "/"
- Add two external endpoints simulating multi-region origins:
az network traffic-manager endpoint create \
--resource-group rg-az305-challenge33 \
--profile-name tm-multiregion-lab \
--name endpoint-eastus \
--type externalEndpoints \
--target "www.microsoft.com" \
--endpoint-location eastus
az network traffic-manager endpoint create \
--resource-group rg-az305-challenge33 \
--profile-name tm-multiregion-lab \
--name endpoint-westeurope \
--type externalEndpoints \
--target "www.microsoft.com" \
--endpoint-location westeurope
- Verify the profile is active and endpoints are monitored:
az network traffic-manager profile show \
--resource-group rg-az305-challenge33 \
--name tm-multiregion-lab \
--query "{Status:profileStatus, Routing:trafficRoutingMethod, FQDN:dnsConfig.fqdn}" -o table
- Confirm both endpoints are online and responding to health checks:
az network traffic-manager endpoint list \
--resource-group rg-az305-challenge33 \
--profile-name tm-multiregion-lab \
--query "[].{Name:name, Status:endpointMonitorStatus, Location:endpointLocation}" -o table
This mini-deployment validates your design decisions with real Azure resources. It is optional but recommended.
Cleanup
az group delete --name rg-az305-challenge33 --yes --no-wait