Desafio 50: Análise de desempenho

Habilidades do exame abordadas

Inspecionar indicadores de desempenho de infraestrutura (CPU, memória, disco, rede)
Analisar métricas usando telemetria coletada (uso, desempenho da aplicação)
Inspecionar rastreamento distribuído usando Application Insights

Cenário

Após o último deployment às 14h30, os usuários relatam que a aplicação web da Contoso está "lenta". A equipe de suporte observa um aumento de 40% nos tickets de reclamação em uma hora. Você deve usar o Azure Monitor, Application Insights e rastreamento distribuído para identificar a causa raiz, correlacioná-la com o deployment específico e determinar se deve fazer rollback ou hotfix.

Pré-requisitos

Assinatura Azure com um recurso Application Insights coletando telemetria
Azure App Service ou cluster AKS com serviços implantados
Application Insights com rastreamento distribuído habilitado entre serviços
Workspace do Log Analytics com métricas de VM ou contêiner
Azure CLI instalado

Tarefas

Tarefa 1: Inspecionar o blade de desempenho do Application Insights

// Overall performance summary: request duration distribution
requests
| where timestamp > ago(4h)
| summarize
    requestCount = count(),
    avgDuration = avg(duration),
    p50 = percentile(duration, 50),
    p90 = percentile(duration, 90),
    p95 = percentile(duration, 95),
    p99 = percentile(duration, 99),
    maxDuration = max(duration)
| project
    requestCount,
    avgDuration = round(avgDuration, 0),
    p50 = round(p50, 0),
    p90 = round(p90, 0),
    p95 = round(p95, 0),
    p99 = round(p99, 0),
    maxDuration = round(maxDuration, 0)

// Slowest requests with dependency breakdown
requests
| where timestamp > ago(2h)
| where duration > 3000  // Requests over 3 seconds
| project operation_Id, name, duration, timestamp, resultCode
| order by duration desc
| take 20

// Slowest dependencies (databases, external APIs, caches)
dependencies
| where timestamp > ago(2h)
| summarize
    avgDuration = avg(duration),
    p95Duration = percentile(duration, 95),
    callCount = count(),
    failRate = round(countif(success == false) * 100.0 / count(), 1)
    by target, type, name
| order by p95Duration desc
| take 15

// Performance comparison: before vs after deployment
let deployTime = datetime(2024-11-15T14:30:00Z);
requests
| where timestamp between ((deployTime - 2h) .. (deployTime + 2h))
| extend period = iff(timestamp < deployTime, "Before", "After")
| summarize
    avgDuration = round(avg(duration), 0),
    p95Duration = round(percentile(duration, 95), 0),
    errorRate = round(countif(success == false) * 100.0 / count(), 2),
    requestCount = count()
    by period

Usando Azure CLI para consultar desempenho:

# Get overall performance metrics
az monitor app-insights metrics show \
  --app ai-contoso-webapp \
  --resource-group rg-contoso-prod \
  --metrics "requests/duration" \
  --aggregation avg,max \
  --interval PT5M \
  --start-time "2024-11-15T12:00:00Z" \
  --end-time "2024-11-15T18:00:00Z"

# Get failed request rate
az monitor app-insights metrics show \
  --app ai-contoso-webapp \
  --resource-group rg-contoso-prod \
  --metrics "requests/failed" \
  --aggregation count \
  --interval PT5M

Tarefa 2: Usar rastreamento distribuído para encontrar o serviço gargalo

// Find the slowest operation and trace it end-to-end
requests
| where timestamp > ago(1h)
| where duration > 5000  // Very slow requests (>5s)
| take 1
| project operation_Id, name, duration, timestamp

// Now trace all dependencies in that operation
let slowOperationId = "abc123-operation-id";
union requests, dependencies, exceptions
| where operation_Id == slowOperationId
| project
    timestamp,
    itemType = itemType,
    name,
    duration,
    success,
    target = coalesce(target, ""),
    resultCode = coalesce(resultCode, ""),
    type = coalesce(type, "")
| order by timestamp asc

// Find the service causing the most latency
dependencies
| where timestamp > ago(1h)
| where duration > 2000  // Slow dependencies
| summarize
    avgDuration = avg(duration),
    slowCallCount = count(),
    failedCount = countif(success == false)
    by target, name, type
| extend impact = avgDuration * slowCallCount
| order by impact desc
| take 10

// End-to-end transaction view for a specific request
let operationId = "specific-operation-id";
union
    (requests | where operation_Id == operationId | extend itemType = "request"),
    (dependencies | where operation_Id == operationId | extend itemType = "dependency"),
    (exceptions | where operation_Id == operationId | extend itemType = "exception"),
    (traces | where operation_Id == operationId | extend itemType = "trace")
| project timestamp, itemType, name, duration, success, target, resultCode
| order by timestamp asc

Navegue pela pesquisa de transações do Application Insights no portal:

Application Insights > Transaction search
Filtre por: duration > 5000ms, últimas 2 horas
Selecione uma requisição lenta
Visualize a timeline de transação de ponta a ponta mostrando todas as dependências
Identifique a dependência que consome mais tempo (destacada na visualização waterfall)

Tarefa 3: Correlacionar com anotações de deployment

// Find the last deployment and compare metrics before/after
let lastDeployment = customEvents
| where name == "Deployment" or name == "DeploymentAnnotation"
| where timestamp > ago(24h)
| top 1 by timestamp desc
| project deployTime = timestamp;
let deployTime = toscalar(lastDeployment);
requests
| where timestamp between ((deployTime - 1h) .. (deployTime + 1h))
| extend relativeMinutes = datetime_diff('minute', timestamp, deployTime)
| extend period = iff(relativeMinutes < 0, "Before", "After")
| summarize
    avgDuration = round(avg(duration), 0),
    p95Duration = round(percentile(duration, 95), 0),
    errorRate = round(countif(success == false) * 100.0 / count(), 2),
    requestCount = count()
    by period, bin(relativeMinutes, 5)
| order by relativeMinutes asc
| render timechart

// Identify which specific endpoints degraded after deployment
let deployTime = datetime(2024-11-15T14:30:00Z);
let beforePerf = requests
| where timestamp between ((deployTime - 1h) .. deployTime)
| summarize beforeAvg = avg(duration), beforeP95 = percentile(duration, 95) by name;
let afterPerf = requests
| where timestamp between (deployTime .. (deployTime + 1h))
| summarize afterAvg = avg(duration), afterP95 = percentile(duration, 95) by name;
beforePerf
| join kind=inner afterPerf on name
| extend degradationPct = round(((afterAvg - beforeAvg) / beforeAvg) * 100, 1)
| where degradationPct > 50  // Endpoints that got 50%+ slower
| order by degradationPct desc
| project name, beforeAvg = round(beforeAvg, 0), afterAvg = round(afterAvg, 0), degradationPct, beforeP95 = round(beforeP95, 0), afterP95 = round(afterP95, 0)

Tarefa 4: Analisar métricas de infraestrutura

// CPU usage over time (from VM Insights or Container Insights)
Perf
| where TimeGenerated > ago(4h)
| where Computer == "vm-contoso-orders"
| where CounterName == "% Processor Time"
| where InstanceName == "_Total"
| summarize avgCPU = avg(CounterValue), maxCPU = max(CounterValue) by bin(TimeGenerated, 5m)
| render timechart

// Memory usage trend
Perf
| where TimeGenerated > ago(4h)
| where Computer == "vm-contoso-orders"
| where CounterName == "Available MBytes"
| summarize avgAvailableMB = avg(CounterValue) by bin(TimeGenerated, 5m)
| render timechart

// Container resource consumption (for AKS)
ContainerInventory
| where TimeGenerated > ago(2h)
| where ContainerHostname contains "payment-service"
| project TimeGenerated, Computer, ContainerHostname, ContainerState

// Container CPU and memory from Container Insights
Perf
| where TimeGenerated > ago(2h)
| where ObjectName == "K8SContainer"
| where CounterName == "cpuUsageNanoCores"
| where InstanceName contains "payment-service"
| summarize avgCPU = avg(CounterValue / 1000000.0) by bin(TimeGenerated, 1m)
| render timechart

// Disk I/O (potential bottleneck for database VMs)
Perf
| where TimeGenerated > ago(4h)
| where Computer == "vm-contoso-orders"
| where CounterName == "Disk Reads/sec" or CounterName == "Disk Writes/sec"
| summarize avgIOPS = avg(CounterValue) by bin(TimeGenerated, 5m), CounterName
| render timechart

// Network throughput
Perf
| where TimeGenerated > ago(4h)
| where Computer == "vm-contoso-orders"
| where CounterName == "Bytes Sent/sec" or CounterName == "Bytes Received/sec"
| summarize avgBytesPerSec = avg(CounterValue) by bin(TimeGenerated, 5m), CounterName
| render timechart

Usando Azure CLI para métricas de infraestrutura:

# Get CPU metrics for App Service
az monitor metrics list \
  --resource "/subscriptions/<sub-id>/resourceGroups/rg-contoso-prod/providers/Microsoft.Web/sites/app-contoso-web" \
  --metric "CpuPercentage" \
  --interval PT5M \
  --start-time "2024-11-15T12:00:00Z" \
  --end-time "2024-11-15T18:00:00Z" \
  --aggregation Average,Maximum

# Get memory metrics
az monitor metrics list \
  --resource "/subscriptions/<sub-id>/resourceGroups/rg-contoso-prod/providers/Microsoft.Web/sites/app-contoso-web" \
  --metric "MemoryWorkingSet" \
  --interval PT5M \
  --aggregation Average,Maximum

# Get HTTP response code breakdown
az monitor metrics list \
  --resource "/subscriptions/<sub-id>/resourceGroups/rg-contoso-prod/providers/Microsoft.Web/sites/app-contoso-web" \
  --metric "Http5xx,Http4xx,Http2xx" \
  --interval PT5M \
  --aggregation Total

Tarefa 5: Criar um workbook do Application Insights para análise de impacto de deployment

Construa um workbook parametrizado que compara métricas antes e depois de qualquer deployment:

// Workbook Query 1: Deployment selector (parameter)
customEvents
| where name == "Deployment"
| where timestamp > ago(30d)
| project deployTime = timestamp, version = tostring(customDimensions.BuildNumber)
| order by deployTime desc

// Workbook Query 2: Request volume and error rate (time chart)
// Uses {DeploymentTime} parameter
let deployTime = todatetime('{DeploymentTime}');
requests
| where timestamp between ((deployTime - 2h) .. (deployTime + 2h))
| summarize
    totalRequests = count(),
    failedRequests = countif(success == false)
    by bin(timestamp, 5m)
| extend errorRate = round((failedRequests * 100.0) / totalRequests, 2)
| project timestamp, totalRequests, errorRate
| render timechart

// Workbook Query 3: Before/After summary table
let deployTime = todatetime('{DeploymentTime}');
requests
| where timestamp between ((deployTime - 1h) .. (deployTime + 1h))
| extend period = iff(timestamp < deployTime, "1-Before", "2-After")
| summarize
    requestCount = count(),
    avgDuration = round(avg(duration), 0),
    p95Duration = round(percentile(duration, 95), 0),
    errorRate = round(countif(success == false) * 100.0 / count(), 2)
    by period

// Workbook Query 4: Most impacted endpoints
let deployTime = todatetime('{DeploymentTime}');
let before = requests | where timestamp between ((deployTime - 1h) .. deployTime)
    | summarize beforeAvg = avg(duration) by name;
let after = requests | where timestamp between (deployTime .. (deployTime + 1h))
    | summarize afterAvg = avg(duration) by name;
before | join after on name
| extend change = round(((afterAvg - beforeAvg) / beforeAvg) * 100, 1)
| project name, beforeMs = round(beforeAvg, 0), afterMs = round(afterAvg, 0), changePct = change
| order by changePct desc
| take 10

// Workbook Query 5: Infrastructure metrics during deployment
let deployTime = todatetime('{DeploymentTime}');
Perf
| where TimeGenerated between ((deployTime - 1h) .. (deployTime + 1h))
| where CounterName == "% Processor Time" and InstanceName == "_Total"
| summarize avgCPU = avg(CounterValue) by bin(TimeGenerated, 5m)
| render timechart

Tarefa 6: Configurar alertas de detecção inteligente

A detecção inteligente encontra automaticamente anomalias no desempenho da aplicação:

# Smart detection is enabled by default in Application Insights
# Verify smart detection configuration
az monitor app-insights component show \
  --app ai-contoso-webapp \
  --resource-group rg-contoso-prod \
  --query "properties.Request_Source"

# Configure smart detection notification recipients
# Azure Portal > Application Insights > Smart Detection > Settings
# - Failure Anomalies: enabled (sends to subscription owners by default)
# - Slow page load time: enabled
# - Slow server response time: enabled
# - Long dependency duration: enabled

# Create a custom metric alert for specific degradation patterns
az monitor metrics alert create \
  --name "alert-response-time-degradation" \
  --resource-group rg-contoso-prod \
  --scopes "/subscriptions/<sub-id>/resourceGroups/rg-contoso-prod/providers/microsoft.insights/components/ai-contoso-webapp" \
  --condition "avg requests/duration > 3000" \
  --window-size 10m \
  --evaluation-frequency 5m \
  --action "/subscriptions/<sub-id>/resourceGroups/rg-contoso-prod/providers/microsoft.insights/actionGroups/ag-sre-team" \
  --description "Average response time exceeds 3 seconds" \
  --severity 2

# Dynamic threshold alert (automatically learns baseline)
az monitor metrics alert create \
  --name "alert-dynamic-response-time" \
  --resource-group rg-contoso-prod \
  --scopes "/subscriptions/<sub-id>/resourceGroups/rg-contoso-prod/providers/microsoft.insights/components/ai-contoso-webapp" \
  --condition "avg requests/duration > dynamic medium 3 of 5" \
  --window-size 5m \
  --evaluation-frequency 5m \
  --action "/subscriptions/<sub-id>/resourceGroups/rg-contoso-prod/providers/microsoft.insights/actionGroups/ag-sre-team" \
  --description "Response time anomaly detected (dynamic threshold)" \
  --severity 3

Tarefa 7: Implementar rastreamento de SLI/SLO

Defina Service Level Indicators e rastreie-os com KQL:

// SLI: Availability (percentage of successful requests)
let sloTarget = 99.9;
requests
| where timestamp > ago(30d)
| summarize
    totalRequests = count(),
    successfulRequests = countif(success == true and resultCode !startswith "5")
| extend
    availability = round((successfulRequests * 100.0) / totalRequests, 3),
    sloTarget = sloTarget,
    errorBudgetTotal = round(totalRequests * (1 - sloTarget / 100), 0),
    errorBudgetUsed = totalRequests - successfulRequests
| extend
    errorBudgetRemaining = errorBudgetTotal - errorBudgetUsed,
    errorBudgetPct = round((errorBudgetUsed * 100.0) / errorBudgetTotal, 1)

// SLI: Latency (percentage of requests under threshold)
let latencyTarget = 99.0;  // 99% of requests under 1 second
let latencyThreshold = 1000;  // milliseconds
requests
| where timestamp > ago(30d)
| summarize
    totalRequests = count(),
    fastRequests = countif(duration < latencyThreshold)
| extend
    latencyCompliance = round((fastRequests * 100.0) / totalRequests, 2),
    sloTarget = latencyTarget,
    withinBudget = (fastRequests * 100.0 / totalRequests) >= latencyTarget

// SLI tracking over time (daily burn rate)
requests
| where timestamp > ago(30d)
| summarize
    totalReq = count(),
    failedReq = countif(success == false or resultCode startswith "5")
    by bin(timestamp, 1d)
| extend
    dailyErrorRate = round((failedReq * 100.0) / totalReq, 3),
    dailyAvailability = round(((totalReq - failedReq) * 100.0) / totalReq, 3)
| render timechart

// Error budget burn rate (are we burning budget too fast?)
let sloTarget = 99.9;
let windowDays = 30;
requests
| where timestamp > ago(30d)
| summarize
    totalReq = count(),
    failedReq = countif(success == false)
    by bin(timestamp, 1d)
| extend
    dailyErrorBudget = totalReq * (1 - sloTarget / 100.0),
    dailyBudgetUsed = failedReq,
    burnRate = round(failedReq / (totalReq * (1 - sloTarget / 100.0)), 2)
| extend isBurningFast = burnRate > 1.0
| render timechart

Crie uma consulta de dashboard SLO para o workbook:

// Multi-window burn rate alert (Google SRE book pattern)
// Fast burn: 14.4x budget consumption in 1 hour
// Slow burn: 6x budget consumption in 6 hours
let sloTarget = 99.9;
let monthlyBudget = 43.2;  // minutes of downtime per month for 99.9%
let fastWindow = 1h;
let slowWindow = 6h;
let fastBurn = requests
| where timestamp > ago(fastWindow)
| summarize errorRate = countif(success == false) * 100.0 / count()
| extend burnRate = errorRate / (100 - sloTarget);
let slowBurn = requests
| where timestamp > ago(slowWindow)
| summarize errorRate = countif(success == false) * 100.0 / count()
| extend burnRate = errorRate / (100 - sloTarget);
union
    (fastBurn | extend window = "1h", threshold = 14.4),
    (slowBurn | extend window = "6h", threshold = 6.0)
| extend alert = burnRate > threshold
| project window, errorRate = round(errorRate, 3), burnRate = round(burnRate, 2), threshold, alert

Exercícios de quebra e conserto

Cenário de quebra 1: Rastreamento distribuído mostra spans ausentes

A visualização de transação de ponta a ponta mostra uma lacuna entre o frontend web chamando o serviço de pagamento, mas a requisição do serviço de pagamento aparece como um trace separado sem correlação.

Causa: O serviço de pagamento não está propagando os headers de contexto de trace W3C. O header traceparent da requisição de entrada não está sendo encaminhado para chamadas downstream.

Diagnóstico:

// Check if operation_Id matches between services
requests
| where timestamp > ago(1h)
| where cloud_RoleName == "payment-service"
| summarize distinctOperations = dcount(operation_Id)
| project distinctOperations

// Compare with dependencies from the calling service
dependencies
| where timestamp > ago(1h)
| where target contains "payment"
| project operation_Id, name, duration
| join kind=leftanti (
    requests
    | where cloud_RoleName == "payment-service"
    | project operation_Id
) on operation_Id
| count  // Number of unmatched traces

Mostrar solução

Correção: Garanta que o SDK do serviço de pagamento esteja configurado para propagação de contexto de trace W3C. Para Node.js:

// Ensure Application Insights is initialized BEFORE other imports
const appInsights = require('applicationinsights');
appInsights.setup(process.env.APPLICATIONINSIGHTS_CONNECTION_STRING)
  .setDistributedTracingMode(appInsights.DistributedTracingModes.AI_AND_W3C)
  .start();

Cenário de quebra 2: Pico de CPU identificado mas não é possível determinar qual processo

Métricas de infraestrutura mostram um pico de CPU na VM no momento do deployment, mas não está claro qual processo da aplicação é responsável.

Diagnóstico:

// Use VM Insights process data to identify the culprit
VMProcess
| where TimeGenerated > ago(2h)
| where Computer == "vm-contoso-orders"
| summarize avgCPU = avg(PercentProcessorTime) by ProcessName, bin(TimeGenerated, 5m)
| where avgCPU > 10
| order by avgCPU desc
| render timechart

Mostrar solução

Correção: Uma vez identificado o processo, correlacione com o deployment para determinar se a nova versão do código introduziu uma regressão de CPU. Verifique índices de banco de dados ausentes, loops ineficientes ou mudanças de configuração.

Verificação de conhecimento

1. Após um deployment, o blade de desempenho do Application Insights mostra que o tempo médio de resposta aumentou de 200ms para 1500ms. A lista de dependências mostra que as chamadas ao banco de dados SQL foram de 50ms em média para 1200ms. O que você deve investigar primeiro?

2. Um trace distribuído no Application Insights mostra: Frontend (50ms) -> API Gateway (30ms) -> Order Service (4500ms) -> Payment Service (timeout). Qual serviço a equipe de SRE deve investigar?

3. O SLO da Contoso é 99,9% de disponibilidade em uma janela de 30 dias. Após 15 dias, eles consumiram 80% do error budget. Qual ação a equipe de SRE deve tomar?

4. Qual funcionalidade do Application Insights detecta automaticamente anomalias de desempenho sem exigir configuração manual de regras de alerta?

Limpeza

# Delete alert rules
az monitor metrics alert delete --name "alert-response-time-degradation" --resource-group rg-contoso-prod
az monitor metrics alert delete --name "alert-dynamic-response-time" --resource-group rg-contoso-prod

# Delete workbooks (via Azure Portal > Application Insights > Workbooks > delete)

# No other infrastructure to clean up - this challenge uses existing monitoring resources

Habilidades do exame abordadas​

Cenário​

Pré-requisitos​

Tarefas​

Tarefa 1: Inspecionar o blade de desempenho do Application Insights​

Tarefa 2: Usar rastreamento distribuído para encontrar o serviço gargalo​

Tarefa 3: Correlacionar com anotações de deployment​

Tarefa 4: Analisar métricas de infraestrutura​

Tarefa 5: Criar um workbook do Application Insights para análise de impacto de deployment​

Tarefa 6: Configurar alertas de detecção inteligente​

Tarefa 7: Implementar rastreamento de SLI/SLO​

Exercícios de quebra e conserto​

Cenário de quebra 1: Rastreamento distribuído mostra spans ausentes​

Cenário de quebra 2: Pico de CPU identificado mas não é possível determinar qual processo​

Verificação de conhecimento​

Limpeza​