Skip to main content

Challenge 08: Cost Management for AI Services

Estimated Time

45-60 min | Cost: ~$0 (analysis only) | Domain: Plan & Manage AI Solutions (20-25%)

Exam skills covered

  • Manage costs for Microsoft Foundry Services
  • Plan capacity using pricing models (pay-per-call vs provisioned throughput)
  • Implement cost optimization strategies for AI workloads

Overview

Managing costs for Azure AI services requires understanding multiple pricing models: pay-per-call for standard deployments, token-based billing for language models, and Provisioned Throughput Units (PTU) for guaranteed capacity. Without careful planning, AI workloads can generate unexpected costs, especially with high-volume generative AI applications.

In this challenge, you'll learn to estimate token costs using the tiktoken library, query Azure Cost Management for AI spending analysis, create budget alerts to prevent overspending, and implement caching strategies to reduce redundant API calls. These skills are critical for operating AI solutions at scale within budget constraints.

Understanding the trade-offs between pay-as-you-go and PTU pricing helps architects choose the right model — PTU provides predictable costs and guaranteed throughput for sustained workloads, while pay-per-call is more economical for bursty or low-volume scenarios.

Architecture

Cost management combines Azure Cost Management APIs, budget alerts, and application-level caching to optimize AI spending.

Challenge 08 topology

Prerequisites

  • Azure subscription with Cost Management access (Reader role minimum)
  • An Azure OpenAI resource with a deployed model (for token estimation)
  • Python with tiktoken package installed
  • Azure CLI installed

Implementation

Task 1: Estimate Token Costs with tiktoken

import tiktoken

# Initialize encoder for the model you're using
# cl100k_base: GPT-4, GPT-3.5-turbo, text-embedding-ada-002
# o200k_base: GPT-4o, GPT-4o-mini
encoder = tiktoken.get_encoding("o200k_base")

# Pricing per 1K tokens (example: GPT-4o as of 2024)
PRICING = {
"gpt-4o": {"prompt": 0.005, "completion": 0.015}, # per 1K tokens
"gpt-4o-mini": {"prompt": 0.00015, "completion": 0.0006},
"gpt-35-turbo": {"prompt": 0.0005, "completion": 0.0015},
}

def count_tokens(text: str, model_encoding: str = "o200k_base") -> int:
"""Count tokens in a text string."""
enc = tiktoken.get_encoding(model_encoding)
return len(enc.encode(text))

def estimate_chat_cost(messages: list[dict], model: str = "gpt-4o",
expected_completion_tokens: int = 500) -> dict:
"""Estimate cost for a chat completion request."""
# Count prompt tokens (simplified - actual includes message formatting overhead)
prompt_text = ""
for msg in messages:
prompt_text += msg["role"] + msg["content"]

prompt_tokens = count_tokens(prompt_text)
# Add ~4 tokens per message for formatting overhead
prompt_tokens += len(messages) * 4

pricing = PRICING[model]
prompt_cost = (prompt_tokens / 1000) * pricing["prompt"]
completion_cost = (expected_completion_tokens / 1000) * pricing["completion"]

return {
"model": model,
"prompt_tokens": prompt_tokens,
"estimated_completion_tokens": expected_completion_tokens,
"total_tokens": prompt_tokens + expected_completion_tokens,
"prompt_cost": prompt_cost,
"completion_cost": completion_cost,
"total_cost": prompt_cost + completion_cost
}

# Example: Estimate costs for a batch of requests
messages = [
{"role": "system", "content": "You are a helpful assistant that summarizes documents."},
{"role": "user", "content": "Summarize the following quarterly report in 3 bullet points: " + "x" * 2000}
]

estimate = estimate_chat_cost(messages, model="gpt-4o", expected_completion_tokens=200)
print(f"=== Single Request Estimate ===")
print(f" Prompt tokens: {estimate['prompt_tokens']}")
print(f" Completion tokens: {estimate['estimated_completion_tokens']}")
print(f" Cost: ${estimate['total_cost']:.6f}")

# Batch estimation
daily_requests = 10000
daily_cost = daily_requests * estimate["total_cost"]
monthly_cost = daily_cost * 30
print(f"\n=== Monthly Projection ===")
print(f" Daily requests: {daily_requests:,}")
print(f" Daily cost: ${daily_cost:.2f}")
print(f" Monthly cost: ${monthly_cost:.2f}")

# Compare PTU vs pay-as-you-go
PTU_MONTHLY_COST = 2000 # Example: 1 PTU at ~$2000/month
PTU_TOKENS_PER_MINUTE = 100000 # Approximate tokens/min per PTU
print(f"\n=== PTU Comparison ===")
print(f" Pay-as-you-go monthly: ${monthly_cost:.2f}")
print(f" 1 PTU monthly: ${PTU_MONTHLY_COST:.2f}")
print(f" PTU is cheaper: {monthly_cost > PTU_MONTHLY_COST}")

Task 2: Query Azure Cost Management for AI Spending

from azure.identity import DefaultAzureCredential
from azure.mgmt.costmanagement import CostManagementClient
from azure.mgmt.costmanagement.models import (
QueryDefinition,
QueryTimePeriod,
QueryDataset,
QueryAggregation,
QueryGrouping,
ExportType,
TimeframeType
)
from datetime import datetime, timedelta

credential = DefaultAzureCredential()
cost_client = CostManagementClient(credential)

subscription_id = "<your-subscription-id>"
scope = f"/subscriptions/{subscription_id}"

# Query AI services costs for the last 30 days
end_date = datetime.utcnow()
start_date = end_date - timedelta(days=30)

query = QueryDefinition(
type=ExportType.ACTUAL_COST,
timeframe=TimeframeType.CUSTOM,
time_period=QueryTimePeriod(
from_property=start_date,
to=end_date
),
dataset=QueryDataset(
granularity="Daily",
aggregation={
"totalCost": QueryAggregation(name="Cost", function="Sum"),
"totalQuantity": QueryAggregation(name="UsageQuantity", function="Sum")
},
grouping=[
QueryGrouping(type="Dimension", name="ServiceName"),
QueryGrouping(type="Dimension", name="MeterCategory")
],
filter={
"dimensions": {
"name": "ServiceName",
"operator": "In",
"values": [
"Cognitive Services",
"Azure OpenAI Service",
"Azure AI Search"
]
}
}
)
)

result = cost_client.query.usage(scope=scope, parameters=query)

print("=== AI Services Cost Breakdown (Last 30 Days) ===")
total_cost = 0
for row in result.rows:
cost = row[0]
quantity = row[1]
service = row[2]
meter = row[3]
total_cost += cost
if cost > 0:
print(f" {service} ({meter}): ${cost:.2f} ({quantity:.0f} units)")

print(f"\n Total AI spending: ${total_cost:.2f}")

Task 3: Create Budget Alert for AI Spending

from azure.identity import DefaultAzureCredential
from azure.mgmt.consumption import ConsumptionManagementClient
from azure.mgmt.consumption.models import Budget, BudgetFilter, BudgetTimePeriod, Notification
from datetime import datetime

credential = DefaultAzureCredential()
subscription_id = "<your-subscription-id>"
consumption_client = ConsumptionManagementClient(credential, subscription_id)

scope = f"/subscriptions/{subscription_id}"

# Create a monthly budget for AI services
budget = Budget(
category="Cost",
amount=500, # $500 monthly budget for AI services
time_grain="Monthly",
time_period=BudgetTimePeriod(
start_date=datetime(2024, 1, 1),
end_date=datetime(2025, 12, 31)
),
filter=BudgetFilter(
dimensions={
"name": "ServiceName",
"operator": "In",
"values": ["Cognitive Services", "Azure OpenAI Service"]
}
),
notifications={
"warning_at_80_percent": Notification(
enabled=True,
operator="GreaterThanOrEqualTo",
threshold=80,
contact_emails=["ai-team@contoso.com"],
threshold_type="Actual"
),
"critical_at_100_percent": Notification(
enabled=True,
operator="GreaterThanOrEqualTo",
threshold=100,
contact_emails=["ai-team@contoso.com", "finance@contoso.com"],
threshold_type="Actual"
),
"forecast_at_120_percent": Notification(
enabled=True,
operator="GreaterThanOrEqualTo",
threshold=120,
contact_emails=["ai-team@contoso.com", "finance@contoso.com"],
threshold_type="Forecasted"
)
}
)

result = consumption_client.budgets.create_or_update(
scope=scope,
budget_name="ai-services-monthly-budget",
parameters=budget
)
print(f"Budget created: {result.name}")
print(f" Amount: ${result.amount}/month")
print(f" Alerts: 80% actual, 100% actual, 120% forecasted")

Task 4: Implement Response Caching to Reduce API Calls

import hashlib
import json
import time
from functools import lru_cache
from azure.identity import DefaultAzureCredential
import os
import redis

# Strategy 1: In-memory LRU cache for identical requests
@lru_cache(maxsize=1000)
def cached_completion(prompt_hash: str, model: str, temperature: float):
"""Cache completions by prompt hash. Only works for deterministic (temp=0) requests."""
# This would call the actual API
pass

def get_prompt_hash(messages: list[dict]) -> str:
"""Generate deterministic hash for a set of messages."""
content = json.dumps(messages, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()

# Strategy 2: Redis cache for distributed applications
class AIResponseCache:
def __init__(self, redis_url: str, default_ttl: int = 3600):
self.redis = redis.from_url(redis_url)
self.default_ttl = default_ttl
self.hits = 0
self.misses = 0

def get_cached_response(self, messages: list[dict], model: str) -> dict | None:
"""Check cache for existing response."""
cache_key = self._make_key(messages, model)
cached = self.redis.get(cache_key)
if cached:
self.hits += 1
return json.loads(cached)
self.misses += 1
return None

def cache_response(self, messages: list[dict], model: str,
response: dict, ttl: int | None = None):
"""Store response in cache."""
cache_key = self._make_key(messages, model)
self.redis.setex(
cache_key,
ttl or self.default_ttl,
json.dumps(response)
)

def _make_key(self, messages: list[dict], model: str) -> str:
content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
return f"ai:completion:{hashlib.sha256(content.encode()).hexdigest()}"

def get_stats(self) -> dict:
total = self.hits + self.misses
hit_rate = (self.hits / total * 100) if total > 0 else 0
return {
"hits": self.hits,
"misses": self.misses,
"hit_rate": f"{hit_rate:.1f}%",
"estimated_savings": f"${self.hits * 0.01:.2f}" # Rough estimate
}

# Usage example
cache = AIResponseCache("redis://localhost:6379")

messages = [{"role": "user", "content": "What is the capital of France?"}]
model = "gpt-4o"

# Check cache first
cached = cache.get_cached_response(messages, model)
if cached:
print(f"Cache HIT: {cached}")
else:
# Call API (simulated)
response = {"content": "The capital of France is Paris.", "tokens": 15}
cache.cache_response(messages, model, response)
print(f"Cache MISS - stored response")

print(f"\nCache stats: {cache.get_stats()}")

Expected Output

=== Single Request Estimate ===
Prompt tokens: 587
Completion tokens: 200
Cost: $0.005935

=== Monthly Projection ===
Daily requests: 10,000
Daily cost: $59.35
Monthly cost: $1,780.50

=== PTU Comparison ===
Pay-as-you-go monthly: $1,780.50
1 PTU monthly: $2,000.00
PTU is cheaper: False

=== AI Services Cost Breakdown (Last 30 Days) ===
Azure OpenAI Service (GPT-4o): $1,245.67 (2,491,340 units)
Cognitive Services (Text Analytics): $89.50 (179,000 units)
Azure AI Search (Standard): $250.00 (1 units)

Total AI spending: $1,585.17

Budget created: ai-services-monthly-budget
Amount: $500/month
Alerts: 80% actual, 100% actual, 120% forecasted

Break & fix

ScenarioSymptomRoot CauseFix
Token count mismatch vs actual billingEstimated tokens differ from usage reportUsing wrong tiktoken encoding for the modelUse o200k_base for GPT-4o, cl100k_base for GPT-4/3.5
Budget alert not firingNo email when threshold exceededBudget filter not matching service name exactlyVerify service names match Cost Management dimension values exactly
Cache hit rate too lowMost requests bypass cacheTemperature > 0 produces different outputs for same promptSet temperature=0 for cacheable requests, or cache only embeddings
Cost query returns no resultsEmpty response from Cost ManagementData not available yet (up to 24h delay)Cost data has 8-24h ingestion delay; query previous day's data
PTU underutilizedPaying for PTU capacity but low usageWorkload is bursty, not sustainedSwitch to pay-as-you-go for bursty workloads; PTU suits steady throughput

Knowledge Check

1. When should you choose Provisioned Throughput Units (PTU) over pay-per-call pricing for Azure OpenAI?

2. Which Python library is used to count tokens for Azure OpenAI models before sending requests?

3. What is the primary benefit of implementing response caching for Azure OpenAI API calls?

4. How long is the typical delay before Azure Cost Management data is available for querying?

5. Which budget notification threshold type alerts you BEFORE you actually exceed your budget?

Cleanup

# No Azure resources created (analysis only)
# If you created a Redis cache for testing:
az group delete --name rg-ai102-challenge08 --yes --no-wait

Learn More