Challenge 30: design high availability for compute

Estimated Time and Cost

60-90 min | Estimated cost: $10-20 | Exam Weight: 15-20%

Introduction

The Federal Benefits Portal (FedBenefits) is a government-operated web application that allows 10 million citizens to manage their retirement benefits, healthcare enrollment, and tax information. The portal is subject to strict uptime requirements mandated by the Government Accountability Office: 99.99% availability (maximum 52.6 minutes of downtime per year). Any outage longer than 5 minutes triggers a congressional reporting requirement and potential audit.

The current architecture runs on 8 VMs behind an Azure Load Balancer in a single availability set within East US 2. Last month, an unplanned Azure platform maintenance event took down the entire availability set for 12 minutes, breaching the SLA. The infrastructure team has been tasked with redesigning the compute layer to survive a full availability zone failure without any user-visible impact.

Some components of FedBenefits are legacy .NET Framework applications that cannot be easily containerized, while newer microservices run on .NET 8 and could leverage PaaS offerings. The architecture must accommodate both IaaS (VMs for legacy) and PaaS (App Service for modern) workloads while achieving the 99.99% composite SLA target. Budget allows for upgrading to zone-redundant infrastructure but not for a full multi-region active-active deployment.

Exam skills covered

Recommend a high availability solution for compute

Design tasks

Part 1: availability sets vs. availability zones

Analyze why the current availability set deployment failed to meet the 99.99% target:
- What SLA does an availability set provide? (99.95%)
- What scenarios can cause ALL VMs in an availability set to be impacted simultaneously?
- What is the difference between fault domains and update domains?
Design the migration from availability sets to availability zones:

Aspect	Availability Set	Availability Zones
SLA	99.95%	99.99%
Failure isolation	Rack-level (fault domain)	Datacenter-level (zone)
Update domain protection	Yes (staged rollouts)	Yes (zone-sequential updates)
Zone failure survival	No	Yes
Cost impact	None	Potential cross-zone egress
VM SKU requirements	Any	Must support AZ

Determine which Azure regions support availability zones and confirm East US 2 support. Document any VM SKU restrictions for zone deployments.

Part 2: Zone-Redundant Load balancing

Design the load balancing architecture for zone-redundant VMs:
- Standard Load Balancer (zone-redundant frontend) distributing across 3 zones
- Health probe configuration (what endpoint, what interval, what threshold)
- Backend pool with VMs spread across Zone 1, 2, and 3
Deploy zone-redundant VMs with a Standard Load Balancer:

# Create zone-redundant load balancer
az network lb create \
  --resource-group rg-fedbenefits \
  --name lb-fedbenefits \
  --sku Standard \
  --frontend-ip-name frontend-ip \
  --backend-pool-name backend-pool \
  --location eastus2

# Create VMs across zones (minimum 2 per zone for zone-level redundancy)
for zone in 1 2 3; do
  for i in 1 2; do
    az vm create \
      --resource-group rg-fedbenefits \
      --name vm-web-z${zone}-${i} \
      --image Win2022Datacenter \
      --size Standard_D4s_v5 \
      --zone $zone \
      --nsg "" \
      --public-ip-address "" \
      --no-wait
  done
done

Configure health probes that detect application-level failures (not just TCP port availability):
- HTTP health probe to /health endpoint
- Probe interval: 5 seconds
- Unhealthy threshold: 2 consecutive failures
- Calculate: How quickly is a failed VM removed from rotation?

Part 3: Virtual machine scale sets (vmss)

Evaluate whether VMSS would be more appropriate than individual VMs for the web tier:

VMSS Deployment Comparison

Click each cell to reveal the answer. Think about your answer first!

	Individual VMs	VMSS Uniform	VMSS Flexible
Auto-scaling	?	?	?
Zone spreading	?	?	?
Rolling updates	?	?	?
Individual VM access	?	?	?
Load balancer integration	?	?	?
Use case fit for FedBenefits	?	?	?

Design a VMSS Flexible configuration for the legacy .NET Framework web tier:
- Spreading across 3 availability zones
- Minimum 6 instances (2 per zone)
- Maximum 18 instances (6 per zone) for peak periods (open enrollment)
- Autoscale rules based on CPU and request count
Configure the autoscale profile:
- Scale-out: Add 2 instances when average CPU > 70% for 5 minutes
- Scale-in: Remove 1 instance when average CPU < 30% for 10 minutes
- Schedule-based scaling: Pre-scale to 12 instances during open enrollment (January)

Part 4: PaaS high availability (App service)

Design the deployment for the .NET 8 microservices using Azure App Service:
- Which App Service plan tier supports availability zones? (Premium v3 or above)
- How does zone-redundant App Service work? (minimum 3 instances, spread across zones)
- What happens if a zone fails? (remaining instances handle traffic)
Configure a zone-redundant App Service plan:

# Zone-redundant App Service requires Premium v3 and minimum 3 instances
az appservice plan create \
  --resource-group rg-fedbenefits \
  --name asp-fedbenefits \
  --location eastus2 \
  --sku P1V3 \
  --number-of-workers 3 \
  --zone-redundant true

Compare the composite SLA of the two approaches:
- Legacy tier: Zone-redundant VMs (99.99%) + Standard LB (99.99%) = ?
- Modern tier: Zone-redundant App Service (99.99%) = ?
- Combined application SLA (both tiers must be available): ?

Success criteria

⬜Availability Zones selected over Availability Sets with documented justification
⬜Zone-redundant Standard Load Balancer configured with health probes
⬜VMSS Flexible or zone-spread VMs deployed across 3 availability zones
⬜Autoscaling configured with CPU-based and schedule-based rules
⬜Zone-redundant App Service plan deployed for PaaS workloads
⬜Composite SLA calculated and validated against 99.99% target

Hints

Hint 1: Availability Zones SLA Details

Availability Zone SLA guarantees for VMs:

Two or more VMs across 2+ zones: 99.99% SLA (52.6 min/year downtime)
Two or more VMs in an availability set: 99.95% SLA (4.38 hours/year)
Single VM with Premium SSD: 99.9% SLA (8.76 hours/year)

The 99.99% SLA means Azure guarantees connectivity to at least one VM instance across zones 99.99% of the time. Key requirement: you need at least 2 VMs in at least 2 different zones.

For FedBenefits: Deploy at least 2 VMs per zone (6 minimum total) to maintain service even if one VM in any zone fails AND an entire zone fails simultaneously.

Hint 2: Standard vs Basic Load Balancer

Only Standard Load Balancer supports availability zones:

Standard LB: Zone-redundant frontend, cross-zone backend pools, 99.99% SLA
Basic LB: No zone support, no SLA, being retired

Standard LB with zone-redundant frontend:

Frontend IP survives any single zone failure
Backend pool can contain VMs from any/all zones
Health probes route traffic only to healthy instances in healthy zones
No cross-zone data transfer charges within the same region

# Verify zone-redundant frontend
az network lb frontend-ip show \
  --resource-group rg-fedbenefits \
  --lb-name lb-fedbenefits \
  --name frontend-ip \
  --query zones

Hint 3: VMSS Flexible vs Uniform

For FedBenefits legacy .NET Framework apps:

VMSS Uniform: All VMs are identical, limited individual VM management, no support for attaching existing VMs. Best for truly identical stateless workloads.
VMSS Flexible: Supports mixed VM sizes, individual VM access via SSH/RDP, can attach existing VMs, supports availability zones. Best for legacy workloads migrating from individual VMs.

VMSS Flexible is recommended for FedBenefits because:

Legacy apps may need individual VM troubleshooting (RDP access)
Mixed VM sizes allow cost optimization (smaller VMs for baseline, larger for burst)
Supports the same deployment patterns as individual VMs but adds autoscaling
Zone spreading is automatic (balances across configured zones)

Hint 4: Zone-Redundant App Service Requirements

Zone-redundant App Service requirements:

Minimum plan tier: Premium v3 (P1v3 or higher) or Isolated v2
Minimum instance count: 3 (one per zone)
Must be configured at plan creation time (cannot enable on existing plan)
Supported regions: Most regions with availability zones
Zone spreading is automatic and not configurable (Azure distributes evenly)

Cost impact: You pay for minimum 3 instances at all times (no scaling below 3). At P1v3 pricing (~$140/month per instance), minimum cost is ~$420/month for zone redundancy.

If a zone fails, the remaining 2 instances handle all traffic. Ensure your application can handle the load with 2/3 of capacity.

Hint 5: Health Probe Best Practices

Health probe configuration for maximum availability:

Endpoint: Custom /health endpoint that checks database connectivity, cache availability, and disk space (not just TCP port check)
Protocol: HTTP/HTTPS (Layer 7) rather than TCP (Layer 4) for application-aware health
Interval: 5-15 seconds (shorter = faster detection but more overhead)
Unhealthy threshold: 2-3 failures (shorter = faster removal but more false positives)

Time to remove unhealthy VM = interval x threshold = 5s x 2 = 10 seconds

Custom health endpoint example:

app.MapGet("/health", async (DbContext db, IConnectionMultiplexer redis) =>
{
    var dbHealthy = await db.Database.CanConnectAsync();
    var redisHealthy = redis.IsConnected;
    return dbHealthy && redisHealthy ? Results.Ok() : Results.StatusCode(503);
});

Learning resources

Knowledge check

1. A government portal requires 99.99% uptime. The current deployment uses an availability set with 4 VMs. Why is this insufficient, and what change is needed?

Availability sets only provide a 99.95% SLA, which allows up to 4.38 hours of downtime per year - far exceeding the 52.6-minute budget for 99.99%. Availability sets protect against rack-level failures (fault domains) and platform updates (update domains) but cannot survive a full datacenter/zone failure. The fix is to migrate to availability zones, deploying VMs across at least 2 zones. This provides a 99.99% SLA because Azure guarantees the zones are physically separate datacenters with independent power, cooling, and networking.

2. A VMSS Flexible orchestration is configured across 3 availability zones with autoscale min=6. During a zone failure, how many instances remain, and is the service still available?

4 instances remain (6 spread evenly across 3 zones = 2 per zone; losing 1 zone = 4 remaining). The service remains available because the Standard Load Balancer's health probes detect the failed zone's instances as unhealthy and route all traffic to the 4 healthy instances in the remaining 2 zones. Autoscale may trigger to add instances in the healthy zones if the reduced capacity causes high CPU. For critical workloads, consider min=9 (3 per zone) so a zone failure leaves 6 instances - enough capacity without autoscale intervention.

3. An App Service plan is zone-redundant with 3 instances. Can you scale down to 1 instance during off-peak hours to save cost?

No. Zone-redundant App Service plans require a minimum of 3 instances at all times. This is a hard constraint because Azure needs at least one instance per zone to maintain zone redundancy. If you scale below 3, zone-redundancy is lost. For cost optimization with zone redundancy, use the smallest SKU that can handle your off-peak load with 3 instances (e.g., P1v3 instead of P2v3). Alternatively, if off-peak traffic is very low, consider whether you truly need zone redundancy 24/7 or only during business hours.

4. What is the composite SLA for an application that requires both a zone-redundant VM tier (99.99%) behind a Standard Load Balancer (99.99%) AND a zone-redundant App Service (99.99%)?

If both tiers must function for the application to work (serial dependency): 0.9999 x 0.9999 x 0.9999 = 99.97%. This is below the 99.99% target. To meet 99.99%, you need to either eliminate one dependency (use App Service for everything) or add redundancy. If the tiers are independent (either can serve users), the parallel formula applies: 1 - (0.0001 x 0.0001) = 99.9999%. In practice, most applications have serial dependencies, so minimizing the number of chained services is critical for achieving 99.99%.

Validation lab

This lab proves that zone-redundant VMSS with a Standard Load Balancer survives a full availability zone failure without manual intervention. You will observe traffic rerouting in real time.

Step 1: deploy zone-redundant VMSS with 6 instances

az group create --name rg-az305-challenge30 --location eastus

az vmss create \
  --resource-group rg-az305-challenge30 \
  --name vmss-ha-lab \
  --image Ubuntu2204 \
  --vm-sku Standard_B1s \
  --instance-count 6 \
  --zones 1 2 3 \
  --admin-username azureuser \
  --generate-ssh-keys \
  --load-balancer lb-ha-lab \
  --upgrade-policy-mode automatic

Step 2: install nginx on all instances to serve hostname

az vmss extension set \
  --resource-group rg-az305-challenge30 \
  --vmss-name vmss-ha-lab \
  --name customScript \
  --publisher Microsoft.Azure.Extensions \
  --version 2.1 \
  --settings '{"commandToExecute":"apt-get update && apt-get install -y nginx && hostname > /var/www/html/index.html"}'

az vmss update-instances \
  --resource-group rg-az305-challenge30 \
  --name vmss-ha-lab \
  --instance-ids "*"

Step 3: verify zone distribution

az vmss list-instances \
  --resource-group rg-az305-challenge30 \
  --name vmss-ha-lab \
  --query "[].{Instance:instanceId, Zone:zones[0]}" \
  -o table

You should see 2 instances per zone (6 total across zones 1, 2, and 3).

Architect Insight

Zone-balanced distribution is critical for capacity planning. With 6 instances across 3 zones, losing one zone leaves 4 instances (67% capacity). Your minimum instance count must be calculated as: (instances needed at peak) * 3/2, rounded up, so that N-1 zones still handle full load.

Step 4: observe traffic distribution across zones

LB_IP=$(az network public-ip show \
  --resource-group rg-az305-challenge30 \
  --name lb-ha-labLBPublicIP \
  --query ipAddress -o tsv)

echo "Load Balancer IP: $LB_IP"

for i in $(seq 1 12); do
  curl -s --max-time 5 http://$LB_IP
  echo ""
done

You should see responses from instances across all 3 zones, demonstrating round-robin distribution.

Step 5: simulate a zone failure

Identify and deallocate all instances in Zone 1:

ZONE1_INSTANCES=$(az vmss list-instances \
  --resource-group rg-az305-challenge30 \
  --name vmss-ha-lab \
  --query "[?zones[0]=='1'].instanceId" -o tsv)

echo "Deallocating Zone 1 instances: $ZONE1_INSTANCES"

for id in $ZONE1_INSTANCES; do
  az vmss deallocate \
    --resource-group rg-az305-challenge30 \
    --name vmss-ha-lab \
    --instance-ids $id \
    --no-wait
done

Step 6: wait for health probes to detect the failure

echo "Waiting 20 seconds for health probes to mark Zone 1 instances as unhealthy..."
sleep 20

Step 7: verify traffic routes only to surviving zones

echo "Traffic after Zone 1 failure:"
for i in $(seq 1 10); do
  curl -s --max-time 5 http://$LB_IP
  echo ""
done

Only hostnames from Zone 2 and Zone 3 instances should appear. Zone 1 instances are gone from rotation.

Architect Insight

The Standard Load Balancer with health probes automatically removed failed instances from rotation. No manual intervention, no DNS changes, no application-level failover logic. This is the behavior that justifies the 99.99% SLA -- the system self-heals at the infrastructure layer. For the AZ-305 exam, remember that this automatic rerouting only works with Standard SKU load balancers; Basic SKU does not support zone-redundant frontends.

Step 8: confirm remaining capacity

az vmss list-instances \
  --resource-group rg-az305-challenge30 \
  --name vmss-ha-lab \
  --query "[?powerState!='VM deallocated'].{Instance:instanceId, Zone:zones[0]}" \
  -o table

echo "Remaining running instances:"
az vmss list-instances \
  --resource-group rg-az305-challenge30 \
  --name vmss-ha-lab \
  --query "length([?powerState!='VM deallocated'])"

You should see 4 running instances across zones 2 and 3 only.

Step 9: restore zone 1 and verify full recovery

for id in $ZONE1_INSTANCES; do
  az vmss start \
    --resource-group rg-az305-challenge30 \
    --name vmss-ha-lab \
    --instance-ids $id \
    --no-wait
done

echo "Waiting 30 seconds for instances to restart and pass health probes..."
sleep 30

echo "Traffic after Zone 1 recovery:"
for i in $(seq 1 12); do
  curl -s --max-time 5 http://$LB_IP
  echo ""
done

All 3 zones should now appear in the responses again, confirming full recovery.

Architect Insight

Recovery is also automatic. Once Zone 1 instances pass health probes again, the load balancer adds them back to rotation. The minimum instance count for a zone-redundant deployment should always be 3x what a single zone needs, so that losing any one zone still leaves sufficient capacity without waiting for autoscale to react.

Design Validation

This lab proved three critical design properties: (1) Zone-redundant VMSS survives a complete availability zone failure with zero manual intervention. (2) The Standard Load Balancer plus health probes handle all traffic rerouting automatically -- no application changes needed. (3) Capacity planning must account for N-1 zones carrying full production load, meaning your minimum instance count should be at least 1.5x your peak requirement.

Cleanup

az group delete --name rg-az305-challenge30 --yes --no-wait

Next: Challenge 31: Design High Availability for Relational Data

Introduction​

Exam skills covered​

Design tasks​

Part 1: availability sets vs. availability zones​

Part 2: Zone-Redundant Load balancing​

Part 3: Virtual machine scale sets (vmss)​

Part 4: PaaS high availability (App service)​

Success criteria​

Hints​

Learning resources​

Knowledge check​

Validation lab​

Step 1: deploy zone-redundant VMSS with 6 instances​

Step 2: install nginx on all instances to serve hostname​

Step 3: verify zone distribution​

Step 4: observe traffic distribution across zones​

Step 5: simulate a zone failure​

Step 6: wait for health probes to detect the failure​

Step 7: verify traffic routes only to surviving zones​

Step 8: confirm remaining capacity​

Step 9: restore zone 1 and verify full recovery​

Cleanup​

Introduction

Exam skills covered

Design tasks

Part 1: availability sets vs. availability zones

Part 2: Zone-Redundant Load balancing

Part 3: Virtual machine scale sets (vmss)

Part 4: PaaS high availability (App service)

Success criteria

Hints

Learning resources

Knowledge check

Validation lab

Step 1: deploy zone-redundant VMSS with 6 instances

Step 2: install nginx on all instances to serve hostname

Step 3: verify zone distribution

Step 4: observe traffic distribution across zones

Step 5: simulate a zone failure

Step 6: wait for health probes to detect the failure

Step 7: verify traffic routes only to surviving zones

Step 8: confirm remaining capacity

Step 9: restore zone 1 and verify full recovery

Cleanup