Challenge 39: design an Event-Driven architecture
60-90 min | Estimated cost: $5-15 | Exam Weight: 30-35%
Introduction
SmartSpace Technologies operates a smart building platform that monitors 10,000 IoT sensors deployed across 50 commercial buildings. Sensors report temperature, humidity, occupancy, and energy consumption data every 5 seconds, generating approximately 120,000 events per minute (2,000 events per second sustained, with spikes to 5,000/second during building opening hours). The platform must handle four distinct event processing scenarios with different latency and durability requirements.
Scenario 1 (Real-time alerts): When a temperature sensor exceeds 35C or occupancy exceeds fire safety limits, an alert must reach building management within 2 seconds. Scenario 2 (Near-real-time dashboards): Building operations dashboards must update within 10-15 seconds to show current conditions across all floors. Scenario 3 (Event archive): All sensor events must be archived for 7 years to support ML model training and regulatory compliance audits. Scenario 4 (Automated responses): When specific conditions are met (e.g., occupancy drops to zero AND energy consumption exceeds threshold), the platform must trigger automated actions (adjust HVAC setpoints, dim lights, send maintenance notifications).
The challenge is designing an event-driven architecture that routes events to the appropriate processing pipeline based on the latency and durability requirements of each scenario.
Exam skills covered
- Recommend an event-driven architecture
Design tasks
Part 1: event ingestion Service selection
- Compare Azure event ingestion services for 2,000-5,000 events/second:
| Feature | Event Hubs | Event Grid | IoT Hub |
|---|---|---|---|
| Throughput | Millions/second | 10M events/second | Hundreds of thousands/second |
| Protocol | AMQP, Kafka, HTTPS | HTTP, MQTT (for IoT) | MQTT, AMQP, HTTPS |
| Consumer model | Pull (consumer groups) | Push (subscriptions) | Pull (consumer groups) + routing |
| Message retention | 1-7 days (Standard), up to 90 days (Premium/Dedicated) | 24 hours (retry) | 1-7 days |
| Ordering | Per-partition | No guarantee | Per-device |
| Device management | No | No | Yes (device twin, C2D) |
| Cost model | Per throughput unit | Per event | Per message + per device |
-
Determine the primary ingestion service:
- Should IoT sensors connect directly to Event Hubs, or should IoT Hub be the entry point?
- If IoT Hub: How does message routing direct events to downstream services?
- What role does Event Grid play in this architecture (event distribution vs event ingestion)?
-
Design the Event Hubs configuration:
- How many partitions are needed for 5,000 events/second peak throughput?
- How many throughput units (Standard) or processing units (Premium)?
- What is the partition key strategy? (Building ID? Sensor type? Floor?)
Part 2: Real-Time alert processing (Scenario 1)
-
Design the real-time alert pipeline (2-second latency requirement):
- Event source: Event Hubs (or IoT Hub route)
- Processing: Evaluate threshold rules in-stream
- Output: Push notification to building management
-
Evaluate processing options for threshold detection:
- Azure Stream Analytics: SQL-like queries, windowed aggregations, reference data joins
- Azure Functions (Event Hub trigger): Custom code, per-event processing
- Spark Structured Streaming (Databricks): Complex analytics, ML inference
-
For the 2-second SLA, which processing option provides the lowest latency? Design the alert rule logic (e.g., temperature > 35C for 3 consecutive readings within 30 seconds, to avoid false positives from sensor noise).
Part 3: event distribution with event Grid (Scenario 4)
-
Design the automated response system using Event Grid:
- Define custom events for building conditions (OccupancyZero, EnergyAnomaly, TemperatureExceedance)
- Create Event Grid topics for each building or a single topic with subject filtering
- Design subscriptions that trigger different Azure Functions based on event type
-
Configure Event Grid filtering:
- Subject prefix filter:
/buildings/building-42/floors/3/ - Advanced filter:
data.temperature > 35 AND data.sensorType == 'ambient' - Determine which filtering level (subject vs advanced) is appropriate for each scenario
- Subject prefix filter:
-
Design the fan-out pattern:
- One event (e.g., OccupancyZero) must trigger multiple actions simultaneously:
- Adjust HVAC setpoints (call Building Management API)
- Dim lights (call Lighting Control API)
- Log to audit trail (write to Cosmos DB)
- How does Event Grid guarantee delivery to all subscribers?
- What happens if one subscriber is temporarily unavailable?
- One event (e.g., OccupancyZero) must trigger multiple actions simultaneously:
Part 4: event archive and Long-Term Storage (Scenario 3)
-
Design the event archival strategy for 7-year retention:
- Event Hubs Capture: Automatically writes events to Azure Storage or Data Lake in Avro format
- Configure capture window: time-based (every 5 minutes) vs size-based (every 256 MB)
- Design the folder structure:
{Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}
-
Calculate storage requirements:
- 120,000 events/minute x 60 x 24 x 365 x 7 years
- Average event size: 500 bytes
- Total raw storage: estimate and identify the appropriate storage tier
- When should data move from Hot to Cool to Archive tier?
-
Design the data lake structure for ML model training:
- Raw events in Avro (immutable, append-only)
- Curated datasets in Parquet (aggregated, optimized for analytics)
- How does the ML team query 7 years of sensor data efficiently?
Part 5: consumer Group strategy
-
Design the consumer group allocation for Event Hubs:
- Consumer Group 1: Real-time alert processor (Stream Analytics)
- Consumer Group 2: Dashboard update service (Functions)
- Consumer Group 3: Event Hubs Capture (archival)
- Consumer Group 4: ML feature pipeline (Databricks)
-
Explain why each consumer needs its own consumer group and what happens if two different applications share a consumer group.
Success criteria
- ⬜Correct ingestion service selected (IoT Hub or Event Hubs) with device management rationale
- ⬜Real-time alert pipeline meets 2-second latency SLA with false-positive mitigation
- ⬜Event Grid configured for automated responses with filtering and fan-out patterns
- ⬜Event Hubs Capture configured for 7-year archival with storage tier lifecycle policy
- ⬜Consumer group strategy allocates independent readers for each processing scenario
- ⬜Architecture clearly separates concerns: ingestion, processing, distribution, and storage
Hints
Hint 1: IoT Hub vs Event Hubs as Entry Point
Use IoT Hub when you need:
- Per-device identity and authentication (X.509 certificates, SAS tokens per device)
- Device twin management (desired/reported properties)
- Cloud-to-device (C2D) commands (e.g., telling a sensor to recalibrate)
- Message routing rules that direct events to different endpoints based on message properties
Use Event Hubs directly when:
- Devices are managed by another system (e.g., a gateway that aggregates sensor data)
- You only need high-throughput event ingestion without device management
- The Kafka protocol compatibility is needed
For 10,000 IoT sensors, IoT Hub is recommended because you need per-device identity, firmware updates, and the ability to send commands back to sensors. IoT Hub has a built-in Event Hub-compatible endpoint for downstream processing.
Hint 2: Event Hubs Partition Strategy
Partitions determine parallelism:
- Each partition supports up to 1 MB/second ingress (Standard) or 20 MB/second (Premium)
- 5,000 events/second at 500 bytes each = 2.5 MB/second ingress
- Standard tier: Need at least 3 partitions (1 MB/s each)
- Recommended: 8-16 partitions for headroom and parallel consumers
Partition key strategy:
- Building ID: All events from one building go to the same partition (good for per-building processing order)
- Sensor ID: Even distribution but no building-level ordering
- Random (null key): Best throughput distribution, no ordering guarantees
For this scenario, Building ID as partition key ensures all events from a building are processed in order by the alert system, enabling multi-sensor correlation within a building.
Hint 3: Event Grid Delivery Guarantees
Event Grid provides at-least-once delivery with retry:
- Default retry policy: 30 attempts over 24 hours with exponential backoff
- If a subscriber fails all retries, events go to a dead-letter container (must be configured)
- Each subscription delivers independently (fan-out is parallel)
- Subscriber unavailability does not block delivery to other subscribers
Configure dead-letter destination (Azure Blob Storage) for each subscription to capture undeliverable events. Set up Azure Monitor alerts on dead-letter count > 0.
Event Grid also supports batching (up to 5,000 events per delivery) and output schema customization (Event Grid schema, CloudEvents schema, or custom input schema).
Hint 4: Storage Calculation for 7-Year Archive
Calculation:
- 120,000 events/minute x 60 minutes x 24 hours x 365 days = ~63 billion events/year
- 63B events x 500 bytes = ~31.5 TB/year raw
- 7 years = ~220 TB raw (before compression)
- Avro with compression: ~50-70% reduction = ~66-110 TB actual storage
Storage tier strategy:
- Last 30 days: Hot tier ($0.018/GB/month) for active dashboards
- 30 days to 1 year: Cool tier ($0.01/GB/month) for ad-hoc analysis
- 1-7 years: Archive tier ($0.002/GB/month) for compliance retention
Lifecycle management policy automates tier transitions. Total estimated cost: approximately $300-500/month for the full 7-year archive.
Learning resources
- Azure Event Hubs overview
- Azure Event Grid overview
- Event Hubs Capture
- IoT Hub message routing
- Choose between Azure messaging services
Knowledge check
1. Two applications read from the same Event Hub using the same consumer group. What problem occurs?
The applications compete for partitions and each receives only a subset of events. Within a consumer group, each partition is assigned to at most one consumer instance. If Application A and Application B share a consumer group across 8 partitions, they split ownership (e.g., A gets partitions 0-3, B gets partitions 4-7). Neither application sees all events. To allow both applications to independently read all events, they must use separate consumer groups. Each consumer group maintains its own read position (offset) per partition.
2. Why is Event Grid better than Event Hubs for the automated response scenario (fan-out to multiple subscribers)?
Event Grid uses push-based delivery to multiple subscribers simultaneously, while Event Hubs requires each subscriber to pull and maintain its own offset. For automated responses where one event must trigger 3-5 different actions (HVAC, lighting, audit), Event Grid natively supports multiple subscriptions per topic, each receiving the event independently with its own retry policy and dead-letter configuration. With Event Hubs, you would need each action handler to poll the hub, maintain checkpoints, and process all events even when only a subset are relevant. Event Grid's server-side filtering reduces unnecessary processing.
3. Event Hubs Capture writes events to Blob Storage every 5 minutes or every 256 MB, whichever comes first. Why not capture every 1 second for lower archival latency?
Frequent capture creates excessive small files that degrade downstream query performance and increase storage transactions costs. Each capture window creates a separate Avro file. Capturing every second would produce 86,400 files per partition per day. Analytic engines (Spark, Synapse) perform poorly scanning millions of tiny files versus fewer larger files. The 5-minute window balances archival latency (maximum 5-minute delay) against file size optimization. If near-real-time archival is needed, use a dedicated consumer group writing to Data Lake via a custom process with file compaction.
4. A temperature alert fires when a single sensor reading exceeds 35C. The building manager reports too many false alarms from brief sensor spikes. How do you reduce false positives?
Use a tumbling or hopping window in Stream Analytics to require multiple consecutive readings above threshold before firing an alert. Instead of alerting on a single reading, configure the rule to require 3 consecutive readings above 35C within a 30-second window (temporal pattern matching). Stream Analytics supports LAG() functions and windowed aggregations for this purpose. Alternatively, use a sliding window average: alert only when the 60-second moving average exceeds 34C. This filters transient sensor noise while still detecting genuine temperature excursions within the 2-second delivery SLA.
Validation lab
This lab validates Event Grid's reactive architecture by observing events flow from resource changes to a storage queue in near-real-time. You will prove that events are pushed without polling, that subscription filters reduce noise, and that event-driven decoupling works at the platform level.
Part a - deploy event Grid infrastructure
- Create the resource group:
az group create \
--name rg-az305-challenge39 \
--location eastus
- Create a Storage Account with a queue to act as the event handler:
az storage account create \
--resource-group rg-az305-challenge39 \
--name stgevents39$RANDOM \
--sku Standard_LRS \
--location eastus
ST_NAME=$(az storage account list \
--resource-group rg-az305-challenge39 \
--query "[0].name" -o tsv)
echo "Storage account: $ST_NAME"
az storage queue create \
--name event-sink \
--account-name $ST_NAME
- Create an Event Grid system topic on the resource group to monitor resource-level events:
RG_ID=$(az group show \
--name rg-az305-challenge39 \
--query "id" -o tsv)
az eventgrid system-topic create \
--resource-group rg-az305-challenge39 \
--name systopic-rg-events \
--topic-type Microsoft.Resources.ResourceGroups \
--source "$RG_ID" \
--location eastus
Part b - subscribe to Resource creation events
- Create an event subscription that routes resource write (creation) events to the storage queue:
ST_ID=$(az storage account show \
--resource-group rg-az305-challenge39 \
--name $ST_NAME \
--query "id" -o tsv)
az eventgrid system-topic event-subscription create \
--resource-group rg-az305-challenge39 \
--system-topic-name systopic-rg-events \
--name sub-all-writes \
--endpoint-type storagequeue \
--endpoint "$ST_ID/queueservices/default/queues/event-sink" \
--included-event-types Microsoft.Resources.ResourceWriteSuccess
Event Grid enables reactive architectures without polling. The storage queue receives events pushed by the platform within seconds of the resource change occurring. No consumer needs to poll Azure Resource Manager asking "did anything change?" -- the platform notifies subscribers proactively. This reduces latency and eliminates wasted API calls.
Part c - trigger an event and verify delivery
- Trigger an event by creating a simple resource (a second storage account) in the resource group:
az storage account create \
--resource-group rg-az305-challenge39 \
--name sttrigger39$RANDOM \
--sku Standard_LRS \
--location eastus
- Wait 15-30 seconds for Event Grid to deliver the event, then check the storage queue:
az storage message peek \
--queue-name event-sink \
--account-name $ST_NAME \
--num-messages 5
You should see one or more messages in the queue. Each message contains the Event Grid event payload.
- Examine the event structure. The payload includes:
eventType: "Microsoft.Resources.ResourceWriteSuccess"subject: the resource ID of the created storage accountdata: contains resource group, resource provider, and operation detailseventTime: timestamp proving near-real-time delivery
Filtering at the subscription level reduces unnecessary processing. By specifying --included-event-types, only resource write events reach the queue. Without filtering, delete events, action events, and other noise would also arrive, forcing the consumer to discard irrelevant messages. This filtering happens at the Event Grid platform level -- the messages never even reach the queue endpoint.
Part d - filtered Subscription (Selective routing)
- Create a second queue and a filtered subscription that only captures storage account events:
az storage queue create \
--name storage-events-only \
--account-name $ST_NAME
az eventgrid system-topic event-subscription create \
--resource-group rg-az305-challenge39 \
--system-topic-name systopic-rg-events \
--name sub-storage-only \
--endpoint-type storagequeue \
--endpoint "$ST_ID/queueservices/default/queues/storage-events-only" \
--included-event-types Microsoft.Resources.ResourceWriteSuccess \
--subject-begins-with "/subscriptions" \
--advanced-filter data.resourceProvider StringContains Microsoft.Storage
- Now trigger a non-storage event by creating a different resource type (a network security group):
az network nsg create \
--resource-group rg-az305-challenge39 \
--name nsg-filter-test
- Wait 15-30 seconds, then verify the filtered subscription did NOT receive the NSG event:
az storage message peek \
--queue-name storage-events-only \
--account-name $ST_NAME \
--num-messages 5
The storage-events-only queue should have no new messages from the NSG creation (or only messages from the storage account created earlier). The unfiltered event-sink queue will have received the NSG event:
az storage message peek \
--queue-name event-sink \
--account-name $ST_NAME \
--num-messages 10
At-least-once delivery means consumers must be idempotent. Event Grid guarantees delivery but may deliver the same event more than once (during retries or infrastructure recovery). Consumers must handle duplicate events gracefully -- typically by checking whether the action was already performed before executing it again. This is a fundamental design principle for any event-driven architecture on the AZ-305 exam.
This lab validated three architectural principles: (1) Event Grid pushes events in near-real-time without any consumer polling -- the storage queue received the event within seconds. (2) Subscription filters reduce noise at the platform level -- the filtered subscription rejected non-storage events before they reached the endpoint. (3) Event-driven architectures decouple producers from consumers -- the resource creation had no knowledge of the event subscriptions, yet the events were delivered automatically.
Cleanup
az eventgrid system-topic delete \
--resource-group rg-az305-challenge39 \
--name systopic-rg-events \
--yes
az group delete --name rg-az305-challenge39 --yes --no-wait