Skip to main content

Challenge 07: Monitor Azure AI Resources

Estimated Time

45-60 min | Cost: ~$0.50 (Log Analytics ingestion) | Domain: Plan & Manage AI Solutions (20-25%)

Exam skills covered

  • Monitor an Azure AI resource
  • Configure diagnostic settings for Azure AI services
  • Query metrics and logs using Azure Monitor and KQL

Overview

Monitoring Azure AI resources is essential for maintaining reliability, tracking usage patterns, and detecting issues before they impact users. Azure Monitor provides a unified platform for collecting metrics, logs, and traces from AI services including latency, request counts, error rates, and token consumption.

In this challenge, you'll configure diagnostic settings to route logs and metrics to a Log Analytics workspace, write KQL queries to analyze service behavior, and set up alert rules for critical thresholds. You'll work with key metrics like TotalCalls, TotalErrors, Latency, and TokenTransaction.

Understanding the monitoring pipeline — from diagnostic settings through Log Analytics to alerts — is a core skill for managing production AI deployments at scale.

Architecture

Diagnostic settings route metrics and logs from Azure AI services to Log Analytics, enabling KQL queries and alert rules.

Challenge 07 topology

Prerequisites

  • Azure subscription with an Azure AI services resource
  • Log Analytics workspace (or will create one)
  • Azure CLI installed
  • Contributor role on the resource group

Implementation

Task 1: Create Log Analytics Workspace and Enable Diagnostic Settings

from azure.identity import DefaultAzureCredential
from azure.mgmt.loganalytics import LogAnalyticsManagementClient
from azure.mgmt.monitor import MonitorManagementClient
from azure.mgmt.monitor.models import (
DiagnosticSettingsResource,
LogSettings,
MetricSettings,
RetentionPolicy
)

credential = DefaultAzureCredential()
subscription_id = "<your-subscription-id>"
resource_group = "rg-ai102-challenge07"

# Create Log Analytics workspace
la_client = LogAnalyticsManagementClient(credential, subscription_id)
workspace = la_client.workspaces.begin_create_or_update(
resource_group_name=resource_group,
workspace_name="law-ai102-monitor",
parameters={
"location": "eastus",
"properties": {
"sku": {"name": "PerGB2018"},
"retention_in_days": 30
}
}
).result()
print(f"Workspace created: {workspace.id}")

# Enable diagnostic settings on AI services resource
monitor_client = MonitorManagementClient(credential, subscription_id)
ai_resource_id = (
f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}"
f"/providers/Microsoft.CognitiveServices/accounts/ai-monitor-demo"
)

diagnostic_settings = monitor_client.diagnostic_settings.create_or_update(
resource_uri=ai_resource_id,
name="ai-diagnostics",
parameters=DiagnosticSettingsResource(
workspace_id=workspace.id,
logs=[
LogSettings(
category="Audit",
enabled=True,
retention_policy=RetentionPolicy(enabled=True, days=30)
),
LogSettings(
category="RequestResponse",
enabled=True,
retention_policy=RetentionPolicy(enabled=True, days=30)
)
],
metrics=[
MetricSettings(
category="AllMetrics",
enabled=True,
retention_policy=RetentionPolicy(enabled=True, days=30)
)
]
)
)
print(f"Diagnostic settings created: {diagnostic_settings.name}")

Task 2: Query Metrics via Azure Monitor REST API

from azure.identity import DefaultAzureCredential
from azure.mgmt.monitor import MonitorManagementClient
from datetime import datetime, timedelta

credential = DefaultAzureCredential()
subscription_id = "<your-subscription-id>"
monitor_client = MonitorManagementClient(credential, subscription_id)

resource_id = (
f"/subscriptions/{subscription_id}/resourceGroups/rg-ai102-challenge07"
f"/providers/Microsoft.CognitiveServices/accounts/ai-monitor-demo"
)

# Query TotalCalls metric for the last 24 hours
end_time = datetime.utcnow()
start_time = end_time - timedelta(hours=24)
timespan = f"{start_time.isoformat()}Z/{end_time.isoformat()}Z"

# Get total calls
metrics_response = monitor_client.metrics.list(
resource_uri=resource_id,
timespan=timespan,
interval="PT1H",
metricnames="TotalCalls,TotalErrors,Latency,TokenTransaction",
aggregation="Total,Average"
)

for metric in metrics_response.value:
print(f"\n=== {metric.name.value} ===")
for timeseries in metric.timeseries:
for data_point in timeseries.data:
if data_point.total is not None:
print(f" {data_point.time_stamp}: Total={data_point.total}")
if data_point.average is not None:
print(f" {data_point.time_stamp}: Avg={data_point.average:.2f}ms")

Task 3: Write KQL Queries for AI Service Logs

from azure.identity import DefaultAzureCredential
from azure.monitor.query import LogsQueryClient
from datetime import timedelta

credential = DefaultAzureCredential()
logs_client = LogsQueryClient(credential)

workspace_id = "<your-workspace-id>"

# KQL: Top operations by count and average duration
kql_operations = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| summarize
RequestCount = count(),
AvgDuration = avg(DurationMs),
P95Duration = percentile(DurationMs, 95),
ErrorCount = countif(ResultType == "Failed")
by OperationName
| sort by RequestCount desc
"""

response = logs_client.query_workspace(
workspace_id=workspace_id,
query=kql_operations,
timespan=timedelta(days=1)
)

print("=== Operations Summary ===")
for row in response.tables[0].rows:
print(f" {row[0]}: {row[1]} calls, Avg: {row[2]:.0f}ms, P95: {row[3]:.0f}ms, Errors: {row[4]}")

# KQL: Error analysis
kql_errors = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where ResultType == "Failed"
| where TimeGenerated > ago(24h)
| summarize ErrorCount = count() by ResultSignature, OperationName
| sort by ErrorCount desc
| take 10
"""

error_response = logs_client.query_workspace(
workspace_id=workspace_id,
query=kql_errors,
timespan=timedelta(days=1)
)

print("\n=== Error Analysis ===")
for row in error_response.tables[0].rows:
print(f" {row[1]} - {row[0]}: {row[2]} errors")

# KQL: Token usage over time (for Azure OpenAI)
kql_tokens = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where Category == "RequestResponse"
| where TimeGenerated > ago(24h)
| extend promptTokens = toint(properties_s.promptTokens)
| extend completionTokens = toint(properties_s.completionTokens)
| summarize
TotalPromptTokens = sum(promptTokens),
TotalCompletionTokens = sum(completionTokens),
TotalTokens = sum(promptTokens) + sum(completionTokens)
by bin(TimeGenerated, 1h)
| sort by TimeGenerated asc
"""

token_response = logs_client.query_workspace(
workspace_id=workspace_id,
query=kql_tokens,
timespan=timedelta(days=1)
)

print("\n=== Token Usage (Hourly) ===")
for row in token_response.tables[0].rows:
print(f" {row[0]}: Prompt={row[1]}, Completion={row[2]}, Total={row[3]}")

Task 4: Create Alert Rule for High Latency

from azure.identity import DefaultAzureCredential
from azure.mgmt.monitor import MonitorManagementClient
from azure.mgmt.monitor.models import (
MetricAlertResource,
MetricAlertSingleResourceMultipleMetricCriteria,
MetricCriteria,
MetricAlertAction
)

credential = DefaultAzureCredential()
subscription_id = "<your-subscription-id>"
monitor_client = MonitorManagementClient(credential, subscription_id)

resource_id = (
f"/subscriptions/{subscription_id}/resourceGroups/rg-ai102-challenge07"
f"/providers/Microsoft.CognitiveServices/accounts/ai-monitor-demo"
)

# Create metric alert for high latency (> 2000ms average)
alert = monitor_client.metric_alerts.create_or_update(
resource_group_name="rg-ai102-challenge07",
rule_name="high-latency-alert",
parameters=MetricAlertResource(
location="global",
description="Alert when average latency exceeds 2000ms",
severity=2,
enabled=True,
scopes=[resource_id],
evaluation_frequency="PT5M",
window_size="PT15M",
criteria=MetricAlertSingleResourceMultipleMetricCriteria(
all_of=[
MetricCriteria(
name="HighLatency",
metric_name="Latency",
metric_namespace="Microsoft.CognitiveServices/accounts",
operator="GreaterThan",
threshold=2000,
time_aggregation="Average"
)
]
),
actions=[
MetricAlertAction(
action_group_id=(
f"/subscriptions/{subscription_id}/resourceGroups/rg-ai102-challenge07"
f"/providers/Microsoft.Insights/actionGroups/ai-ops-team"
)
)
]
)
)
print(f"Alert rule created: {alert.name}")

# Create alert for high error rate (> 5% of total calls)
error_alert = monitor_client.metric_alerts.create_or_update(
resource_group_name="rg-ai102-challenge07",
rule_name="high-error-rate-alert",
parameters=MetricAlertResource(
location="global",
description="Alert when error rate exceeds 5%",
severity=1,
enabled=True,
scopes=[resource_id],
evaluation_frequency="PT5M",
window_size="PT5M",
criteria=MetricAlertSingleResourceMultipleMetricCriteria(
all_of=[
MetricCriteria(
name="HighErrors",
metric_name="TotalErrors",
metric_namespace="Microsoft.CognitiveServices/accounts",
operator="GreaterThan",
threshold=10,
time_aggregation="Total"
)
]
),
actions=[]
)
)
print(f"Error alert created: {error_alert.name}")

Expected Output

=== Operations Summary ===
TextAnalytics.Analyze: 1247 calls, Avg: 342ms, P95: 890ms, Errors: 3
OpenAI.ChatCompletions: 856 calls, Avg: 1205ms, P95: 3400ms, Errors: 12
TextAnalytics.DetectLanguage: 432 calls, Avg: 156ms, P95: 340ms, Errors: 0

=== Error Analysis ===
OpenAI.ChatCompletions - 429: 8 errors
OpenAI.ChatCompletions - 500: 4 errors
TextAnalytics.Analyze - 400: 3 errors

=== Alert Rules ===
Name Severity Enabled Condition
high-latency-alert 2 True avg Latency > 2000
high-error-rate-alert 1 True total TotalErrors > 10
token-spike-alert 3 True total TokenTransaction > 100000

Break & fix

ScenarioSymptomRoot CauseFix
No logs appearing in Log AnalyticsKQL queries return empty resultsDiagnostic settings not enabled or recent (ingestion delay 5-15 min)Verify diagnostic settings exist; wait for ingestion delay
Metric alert never firesNo alert notifications despite high latencyWrong metric namespace or aggregation typeVerify Microsoft.CognitiveServices/accounts namespace and correct aggregation
"No access" error on Log Analytics query403 when querying workspaceMissing Log Analytics Reader role on workspaceAssign Log Analytics Reader role to the querying identity
Incomplete metrics dataSome metrics show gapsResource SKU doesn't emit all metricsVerify S0 tier; free tier has limited metrics emission
Alert fires too frequentlyAlert noise/fatigueWindow size too small or threshold too lowIncrease window-size or adjust threshold to reduce false positives

Knowledge Check

1. Which Azure Monitor metric tracks the total number of tokens processed by an Azure OpenAI resource?

2. What is the typical ingestion delay for logs appearing in a Log Analytics workspace after diagnostic settings are enabled?

3. Which KQL table contains diagnostic logs from Azure Cognitive Services resources?

4. When creating a metric alert rule, what does the 'window size' parameter control?

5. Which log category must be enabled in diagnostic settings to capture API request and response details for Azure AI services?

Cleanup

az group delete --name rg-ai102-challenge07 --yes --no-wait

Learn More