Challenge 25: design Recovery objectives & strategy
60-90 min | Estimated cost: $0-5 | Exam Weight: 15-20%
Introduction
Mercy Regional Health System operates a network of hospitals and clinics serving 500,000 patients across three states. Their IT infrastructure supports everything from life-critical patient monitoring systems to routine administrative functions. After a recent ransomware incident at a neighboring health system caused a 72-hour outage of patient records, the board has mandated a comprehensive disaster recovery strategy.
The CIO has categorized all workloads into three tiers based on business impact analysis: Tier 1 (critical) includes the Electronic Health Records (EHR) system and patient monitoring - these must recover within 1 minute with zero data loss. Tier 2 (important) includes the appointment scheduling system, pharmacy management, and lab results portal - these can tolerate up to 1 hour of downtime and 15 minutes of data loss. Tier 3 (standard) includes HR/payroll, training portals, and internal communications - these can tolerate up to 24 hours of downtime and 4 hours of data loss.
The challenge is significant: Mercy has a DR budget of only $5,000/month to protect all three tiers. You must design a recovery strategy that appropriately allocates budget across tiers, selecting the right recovery pattern (hot/warm/cold standby) for each workload class while proving that the composite SLA meets availability requirements.
Exam skills covered
- Recommend a recovery solution for Azure and hybrid workloads that meets recovery objectives
Design tasks
Part 1: Business impact analysis and Recovery objectives
-
For each workload tier, formally define the following recovery parameters:
- Recovery Time Objective (RTO)
- Recovery Point Objective (RPO)
- Recovery Level Objective (RLO) - what level of functionality is acceptable during recovery
- Maximum Tolerable Downtime (MTD) - the absolute maximum before business viability is threatened
-
Calculate the required uptime percentage for each tier:
- Tier 1: RTO of 1 minute implies what SLA percentage?
- Tier 2: RTO of 1 hour implies what SLA percentage?
- Tier 3: RTO of 24 hours implies what SLA percentage?
-
Document the business impact of exceeding RTO for each tier (financial loss per hour, patient safety risk, regulatory penalties).
Part 2: Recovery strategy selection
-
Map each workload tier to the appropriate recovery pattern:
- Hot standby: Active-active or active-passive with real-time replication
- Warm standby: Scaled-down replica that can be scaled up during failover
- Cold standby: Infrastructure defined as code, deployed on-demand during disaster
- Backup only: Regular backups with restore-from-scratch recovery
-
Complete this decision matrix for each tier:
| Tier 1 (Critical) | Tier 2 (Important) | Tier 3 (Standard) | |
|---|---|---|---|
| Recovery pattern | ? | ? | ? |
| Monthly DR cost | ? | ? | ? |
| Data replication method | ? | ? | ? |
| Failover automation | ? | ? | ? |
| Testing frequency | ? | ? | ? |
- Justify why hot standby is required for Tier 1 but would be wasteful for Tier 3.
Part 3: SLA composition and budget allocation
-
Calculate the composite SLA for a Tier 1 workload that depends on:
- Azure Virtual Machines (99.99% with Availability Zones)
- Azure SQL Database Business Critical (99.995%)
- Azure Load Balancer (99.99%)
- Azure ExpressRoute (99.95%)
Use the formula: Composite SLA = SLA1 x SLA2 x SLA3 x SLA4
-
Determine if the composite SLA meets the Tier 1 requirement. If not, design compensating measures (multi-region, redundant paths) to achieve the target.
-
Allocate the $5,000/month DR budget across tiers. Consider that hot standby costs roughly 80-100% of production costs, warm standby costs 30-50%, and cold standby costs 5-10%.
Part 4: Recovery strategy documentation
-
Create a recovery strategy document that maps Azure services to each tier:
- Tier 1: Which Azure services provide sub-minute RTO?
- Tier 2: Which services provide 1-hour RTO at moderate cost?
- Tier 3: Which services enable 24-hour recovery at minimal cost?
-
Define the DR testing schedule and validation criteria for each tier.
Success criteria
- ⬜RTO, RPO, RLO, and MTD defined for all three workload tiers with business justification
- ⬜Appropriate recovery pattern (hot/warm/cold) selected for each tier with cost analysis
- ⬜Composite SLA calculated correctly using multiplication formula
- ⬜Budget allocation across tiers documented with cost-per-tier breakdown totaling $5K/month
- ⬜Recovery strategy maps specific Azure services to each tier's requirements
- ⬜DR testing schedule defined with appropriate frequency per tier
Hints
Hint 1: SLA Composition Formula
When services are chained in series (each depends on the previous), multiply their SLAs:
Composite SLA = 0.9999 x 0.99995 x 0.9999 x 0.9995 = 0.99925 (approximately 99.925%)
This means roughly 6.5 hours of downtime per year. To improve this, add redundancy (parallel paths) where:
Availability with redundancy = 1 - (1 - SLA_A) x (1 - SLA_B)
For example, dual ExpressRoute circuits: 1 - (1 - 0.9995)^2 = 0.99999975
Hint 2: Recovery Pattern Cost Estimates
Approximate monthly costs for a typical 3-tier application (web + app + DB):
- Hot standby (active-active): $3,000-4,000/month (full replica running)
- Warm standby (scaled-down replica): $800-1,500/month (minimal SKUs, can scale up)
- Cold standby (IaC + backups): $100-300/month (only storage for backups/templates)
- Backup only: $50-150/month (just backup vault storage)
Budget allocation suggestion: Tier 1 gets 60-70%, Tier 2 gets 20-30%, Tier 3 gets 5-10%.
Hint 3: Azure Services by Recovery Speed
Sub-minute RTO (Tier 1):
- Azure SQL Database with failover groups (automatic failover)
- Availability Zones for VMs (zone-redundant)
- Azure Front Door / Traffic Manager (DNS-based failover)
- Cosmos DB with multi-region writes
1-hour RTO (Tier 2):
- Azure Site Recovery (15-minute RPO, minutes to failover)
- Azure SQL geo-restore
- VM redeployment from managed images
24-hour RTO (Tier 3):
- Azure Backup with restore
- Redeploy from ARM/Bicep templates
- Cold storage backups with manual restore
Hint 4: Uptime Percentage Calculation
To convert RTO to minimum uptime percentage:
- Minutes in a year: 525,600
- RTO 1 min: (525,600 - 1) / 525,600 = 99.99981% (but this assumes only ONE outage per year)
- More realistically, consider monthly SLA targets:
- 99.99% = 4.32 min downtime/month
- 99.95% = 21.6 min downtime/month
- 99.9% = 43.2 min downtime/month
- 99% = 7.2 hours downtime/month
Learning resources
- Business continuity and disaster recovery - Cloud Adoption Framework
- Azure Well-Architected Framework - Reliability pillar
- Backup and disaster recovery for Azure applications
- SLA summary for Azure services
- Composite SLA calculation
Knowledge check
1. A workload has a composite SLA of 99.9% but requires 99.99% availability. What architectural change most effectively closes this gap?
Add multi-region redundancy with automatic failover. When a single-region deployment cannot achieve the target SLA through component multiplication alone, deploying to a second region and using a global load balancer (Azure Front Door or Traffic Manager) creates parallel availability paths. The formula becomes: 1 - (1 - 0.999)^2 = 0.999999 (99.9999%), which exceeds the requirement. The trade-off is increased cost and complexity of data synchronization.
2. Why would you choose warm standby over hot standby for a Tier 2 workload with 1-hour RTO?
Warm standby costs 30-50% of production versus 80-100% for hot standby, and the 1-hour RTO provides sufficient time to scale up resources. Hot standby maintains a full-capacity replica running at all times, which is unnecessary when you have 60 minutes to detect failure, trigger failover, and scale up a minimal replica. Warm standby keeps a scaled-down version running (e.g., smaller VM SKUs, lower DTU databases) that can be scaled to production capacity within the RTO window.
3. A hospital's EHR system depends on four Azure services, each with 99.99% SLA. What is the composite SLA, and does it meet a 99.99% target?
The composite SLA is 0.9999^4 = 99.96%, which does NOT meet the 99.99% target. When multiple services are chained in series, the composite SLA is always lower than the weakest individual SLA. Each additional dependency reduces the overall availability. To meet 99.99% with four dependencies, you need either higher individual SLAs (e.g., Business Critical tier at 99.995%) or redundancy at one or more layers to compensate for the multiplicative effect.
4. What is the key difference between RTO and MTD (Maximum Tolerable Downtime)?
RTO is the target recovery time for IT systems; MTD is the absolute maximum time before the business itself is threatened. RTO should always be shorter than MTD to provide a safety margin. For example, a hospital's EHR system might have an RTO of 1 minute (target to restore service) but an MTD of 15 minutes (beyond which patient safety is at risk and regulatory violations occur). The gap between RTO and MTD is your safety buffer for unexpected recovery complications.
Validation lab
Deploy a minimal proof-of-concept to validate your design:
- Create a resource group for this lab:
az group create --name rg-az305-challenge25 --location eastus
- Deploy two VMs in different availability zones to observe SLA composition:
az vm create \
--resource-group rg-az305-challenge25 \
--name vm-zone1 \
--image Ubuntu2204 \
--size Standard_B1s \
--zone 1 \
--admin-username azureuser \
--generate-ssh-keys \
--no-wait
az vm create \
--resource-group rg-az305-challenge25 \
--name vm-zone2 \
--image Ubuntu2204 \
--size Standard_B1s \
--zone 2 \
--admin-username azureuser \
--generate-ssh-keys \
--no-wait
- Verify zone placement for each VM:
az vm show \
--resource-group rg-az305-challenge25 \
--name vm-zone1 \
--query "{name:name, zone:zones[0]}" -o table
az vm show \
--resource-group rg-az305-challenge25 \
--name vm-zone2 \
--query "{name:name, zone:zones[0]}" -o table
- Confirm the SLA tier by listing availability zone assignments:
az vm list \
--resource-group rg-az305-challenge25 \
--query "[].{Name:name, Zone:zones[0]}" -o table
- Verify both VMs are running in separate zones (this configuration qualifies for 99.99% SLA):
az vm list \
--resource-group rg-az305-challenge25 \
--query "length(unique([].zones[0]))" -o tsv
This mini-deployment validates your design decisions with real Azure resources. It is optional but recommended.
Cleanup
az group delete --name rg-az305-challenge25 --yes --no-wait