Challenge 29: design a disaster Recovery plan

Estimated Time and Cost

90-120 min | Estimated cost: $15-30 | Exam Weight: 15-20%

Introduction

ShopStream is a mid-market e-commerce platform serving 2 million active customers with $50M in annual revenue. Their platform runs as a classic 3-tier architecture on Azure: a web tier (frontend and CDN), an API tier (order processing, inventory management, payment gateway integration), and a database tier (Azure SQL for transactions, Redis for session state, Azure Storage for product images). The primary deployment is in East US 2.

After a 4-hour outage last Black Friday caused by a storage subsystem failure in their primary region, ShopStream lost approximately $800K in revenue and significant customer trust. The board has mandated a comprehensive disaster recovery plan with the following hard requirements: web tier must recover within 5 minutes (RTO), API tier within 10 minutes (RTO), and the database tier must have no more than 5 seconds of data loss (RPO). The DR budget is $3,000/month for secondary infrastructure in West US 2.

This challenge combines all backup and DR skills from the previous challenges into a complete, end-to-end disaster recovery plan using Azure Site Recovery, geo-replicated databases, and automated failover routing. You will design the replication strategy, create recovery plans with sequenced failover, and validate the design meets all requirements within budget.

Exam skills covered

Recommend a recovery solution for Azure and hybrid workloads that meets recovery objectives
Recommend a backup and recovery solution for compute
Recommend a backup and recovery solution for databases

Design tasks

Part 1: Azure Site Recovery for compute tiers

Design the Azure Site Recovery (ASR) configuration for the web and API tiers:
- Web tier: 4 VMs behind a load balancer (stateless, session stored in Redis)
- API tier: 6 VMs processing orders (stateless except for in-flight transactions)
- Determine replication frequency, recovery point retention, and crash-consistent vs. app-consistent snapshots
Configure ASR replication for a representative VM:

# Create ASR resources in the DR region
az group create --name rg-shopstream-dr --location westus2

# Note: ASR configuration typically uses the portal or PowerShell
# Conceptual configuration:
# Source: east US 2, target: west US 2
# Replication policy: 
# - Recovery point retention: 24 hours
# - app-consistent snapshot frequency: 4 hours
# - crash-consistent replication: continuous (rpo ~30 seconds)

Document the network configuration for the DR site:
- Virtual network in West US 2 (mirror of production)
- NSG rules replicated or pre-configured
- Public IP addresses for load balancers in DR region
- DNS strategy for cutover (Azure DNS with low TTL or Traffic Manager)

Part 2: database tier DR strategy

Design the database replication strategy for each data store:

DR Strategy by Data Store

Click each cell to reveal the answer. Think about your answer first!

	DR Approach	RPO	RTO	Failover Mode
Azure SQL (transactions)	?	?	?	?
Redis Cache (sessions)	?	?	?	?
Azure Storage (images)	?	?	?	?

Configure an Azure SQL failover group for the transaction database:
- Automatic failover with appropriate grace period
- Read-only endpoint for the secondary (can serve read traffic during normal operations)
- Connection string strategy that survives failover without application changes
Design the Redis Cache DR strategy:
- Evaluate geo-replication for Azure Cache for Redis (Premium/Enterprise tier)
- Consider whether session loss is acceptable (users re-authenticate) vs. session replication cost
- Document the failover process for Redis
Configure geo-redundant storage (RA-GRS) for product images and design the failover:
- How does RA-GRS failover work (Microsoft-initiated vs. customer-initiated)?
- What is the RPO for storage geo-replication?
- How do you redirect reads to the secondary endpoint during an outage?

Part 3: Recovery plan orchestration

Create a sequenced recovery plan that defines failover order:
- Group 1: Database tier (SQL failover group activates first)
- Group 2: API tier (VMs start after database is available)
- Group 3: Web tier (frontend VMs start after API is available)
- Group 4: Traffic routing (DNS/Front Door switches to DR region)
Add automation to the recovery plan:
- Pre-failover script: Verify database replication is current (< 5 seconds behind)
- Post-failover script: Update application configuration to point to DR database endpoints
- Post-failover script: Run health checks on each tier before proceeding to next group
- Post-failover script: Send notification to operations team with failover status
Design the traffic routing failover using Azure Front Door or Traffic Manager:
- Which service provides faster failover detection (health probes)?
- Configure health probe endpoints for each tier
- Define failover threshold (how many failed probes before switching?)
- Calculate total failover time: detection + DNS propagation + VM startup

Part 4: failback and DR testing

Design the failback procedure after the primary region recovers:
- Re-protect (reverse replication from DR back to primary)
- Data synchronization strategy during failback
- Planned failover back to primary (zero data loss)
- Handling the "split brain" scenario if both regions accept writes
Create a DR testing schedule and procedure:
- Test failover (non-disruptive, uses isolated network in DR)
- Planned failover (validates actual cutover with minimal downtime)
- Frequency: How often should each type of test be conducted?
- Success criteria: What metrics validate a successful DR test?
Calculate the monthly cost of the DR infrastructure and verify it fits within $3,000/month:
- ASR replication cost per VM
- DR region VMs (only running during failover, but disks pre-provisioned)
- SQL Database secondary (readable, so useful for read offload)
- Storage GRS replication cost
- Total monthly cost vs. $3K budget

Success criteria

⬜ASR configured for web and API tier VMs with appropriate replication policy
⬜SQL failover group configured with automatic failover and validated RPO meets 5-second requirement
⬜Recovery plan created with correct sequencing (DB -> API -> Web -> Traffic)
⬜Traffic routing failover designed with health probes and failover threshold defined
⬜DR cost estimate documented and validated against $3K/month budget
⬜Failback procedure and DR testing schedule documented

Hints

Hint 1: ASR Replication and RPO

Azure Site Recovery provides:

Continuous replication: Disk writes are continuously replicated to the target region
RPO: Typically 30 seconds to 2 minutes for VMs (depends on data change rate)
App-consistent snapshots: Every 1-12 hours (configurable) - captures application state
Crash-consistent snapshots: Every 5 minutes - captures disk state

For ShopStream's stateless web/API tiers, crash-consistent snapshots are sufficient because:

Sessions are in Redis (not on the VM)
In-flight transactions will be retried by the client
No local database state to protect

ASR cost: approximately $25/month per protected VM + storage for replica disks.

Hint 2: SQL Failover Group Configuration

Failover group key settings:

Grace period: Minimum 1 hour. This is how long automatic failover waits after detecting primary failure before switching. Shorter = faster failover but higher risk of false positives.
Read-write endpoint: fg-shopstream.database.windows.net (always points to primary)
Read-only endpoint: fg-shopstream.secondary.database.windows.net (always points to secondary)
RPO: Typically under 5 seconds for asynchronous geo-replication (varies based on transaction volume and network conditions)

az sql failover-group create \
  --resource-group rg-shopstream \
  --server sql-shopstream-primary \
  --partner-server sql-shopstream-dr \
  --name fg-shopstream \
  --failover-policy Automatic \
  --grace-period 1

Application connection string should use the failover group endpoint, NOT the individual server name.

Hint 3: Recovery Plan Sequencing in ASR

ASR Recovery Plans allow you to define groups that fail over in sequence:

Create a Recovery Plan in the Recovery Services vault
Add VMs to numbered groups (Group 1 starts first)
Add pre/post actions (scripts, Azure Automation runbooks, manual steps)

Typical sequencing for 3-tier app:

Group 1: Database failover (run SQL failover-group failover as a pre-action)
Group 2: API tier VMs (configure to wait for database health check)
Group 3: Web tier VMs (configure to wait for API health check)
Post-action: Update Traffic Manager/Front Door to point to DR region

Each group completes before the next starts. Within a group, all VMs start in parallel.

Hint 4: DR Cost Estimation

Monthly cost breakdown for ShopStream DR ($3,000 budget):

ASR replication (10 VMs): 10 x $25 = $250/month
Replica managed disks (10 VMs, avg 256 GB each): 10 x 256 GB x $0.05 = $128/month
SQL Database secondary (General Purpose, 8 vCores): ~$800/month (but provides read offload value)
Redis geo-replication (Premium P1): ~$450/month (consider if sessions are expendable)
Storage GRS delta: ~$200/month (GRS costs ~2x LRS for 5 TB of images)
ASR network egress during replication: ~$50/month

Estimated total: ~$1,878/month without Redis geo-replication, ~$2,328 with it. Both fit within $3K budget.

If Redis is too expensive, accept session loss during failover (users re-login) and use Premium without geo-replication in DR region, starting fresh on failover.

Hint 5: Front Door vs Traffic Manager for Failover

Azure Front Door:

Layer 7 (HTTP/HTTPS) load balancer with health probes
Probe interval: as low as 5 seconds
Failover detection: ~10-30 seconds (configurable probe frequency x threshold)
No DNS TTL dependency (uses anycast - instant routing change)
Additional features: WAF, caching, SSL offload

Azure Traffic Manager:

DNS-based routing (returns IP of healthy backend)
Probe interval: 10-30 seconds
Failover detection: 30-90 seconds + DNS TTL propagation (30-300 seconds)
Total failover time can be 1-5 minutes depending on DNS TTL settings
Simpler, cheaper, supports non-HTTP protocols

For ShopStream's 5-minute web tier RTO: Front Door is preferred (faster detection, no DNS propagation delay). Traffic Manager works if you set TTL to 30 seconds but adds risk of cached DNS entries.

Learning resources

Knowledge check

1. A 3-tier application has RTO requirements of 5 min (web), 10 min (API), and 30 sec (database). Why must the database tier fail over FIRST in the recovery plan?

The API tier depends on the database - if API VMs start before the database is available, they will crash or return errors. Similarly, the web tier depends on the API tier. Recovery plans must respect dependency order: the lowest tier in the stack (database) must be available before upper tiers start. Even though the database has the strictest RTO (30 sec), starting it first ensures upper tiers can initialize their database connections successfully. ASR recovery plans enforce this through numbered groups that execute sequentially.

2. ShopStream uses Azure Front Door for traffic routing. During a failover, what happens to in-flight user requests that were being served by the primary region?

In-flight requests to the failed primary region will timeout and fail. Azure Front Door health probes detect the backend failure within 10-30 seconds and stop routing NEW requests to the unhealthy origin. However, requests already in transit (received by the failed backend) will fail with timeout or connection errors. The client must retry, and the retry will be routed to the healthy DR origin. For e-commerce, this means some users may see an error page for 10-30 seconds during failover. Stateless API design ensures retried requests succeed without side effects (idempotency is critical).

3. The DR budget is $3,000/month. What is the most significant cost component, and how can it provide value even during normal operations?

The SQL Database secondary (failover group) is the largest cost at ~$800/month but provides read-offload capability during normal operations. The secondary database in a failover group has a readable endpoint that can serve read-heavy queries (reporting, analytics, search) without impacting the primary. This effectively doubles read capacity at no additional cost beyond the DR investment. Similarly, Redis geo-replication secondary can serve reads for geographically distributed users. Converting DR spend into performance improvement justifies the cost to stakeholders.

4. After a successful failover to West US 2, the primary region (East US 2) recovers. What is the correct sequence to fail back without data loss?

Re-protect, synchronize, then planned failover. The sequence is: (1) Re-protect: enable reverse replication from West US 2 (current primary) back to East US 2. (2) Wait for initial synchronization to complete (full copy of changes made during the outage). (3) Verify replication is healthy and lag is minimal. (4) Execute a planned failover from West US 2 back to East US 2 (this briefly stops writes to ensure zero data loss during cutover). (5) Re-protect again to restore the original DR direction. Skipping re-protection risks data loss if East US 2 has stale data.

Validation lab

Deploy a minimal proof-of-concept to validate your design:

Create a resource group for this lab:

az group create --name rg-az305-challenge29 --location eastus

Deploy a VM to use as the ASR replication source:

az vm create \
  --resource-group rg-az305-challenge29 \
  --name vm-asr-source \
  --image Ubuntu2204 \
  --size Standard_B1s \
  --zone 1 \
  --admin-username azureuser \
  --generate-ssh-keys

Create a Recovery Services vault and set the replication policy:

az backup vault create \
  --resource-group rg-az305-challenge29 \
  --name vault-az305-challenge29 \
  --location eastus

az backup policy list \
  --resource-group rg-az305-challenge29 \
  --vault-name vault-az305-challenge29 \
  --query "[].name" -o table

Enable backup on the VM to validate vault integration:

az backup protection enable-for-vm \
  --resource-group rg-az305-challenge29 \
  --vault-name vault-az305-challenge29 \
  --vm vm-asr-source \
  --policy-name DefaultPolicy

Verify the VM is registered for protection:

az backup item list \
  --resource-group rg-az305-challenge29 \
  --vault-name vault-az305-challenge29 \
  --query "[].{Name:name, Status:properties.protectionStatus}" -o table

tip

This mini-deployment validates your design decisions with real Azure resources. It is optional but recommended.

Cleanup

az group delete --name rg-az305-challenge29 --yes --no-wait

Next: Challenge 30: Design High Availability for Compute

Introduction​

Exam skills covered​

Design tasks​

Part 1: Azure Site Recovery for compute tiers​

Part 2: database tier DR strategy​

Part 3: Recovery plan orchestration​

Part 4: failback and DR testing​

Success criteria​

Hints​

Learning resources​

Knowledge check​

Validation lab​

Cleanup​