Skip to main content

Challenge 16: Azure OpenAI: Provisioning and Configuration

Estimated Time

45-60 min | Cost: ~$1.00 (estimated) | Domain: Generative AI Solutions (15-20%)

Exam skills covered

  • Provision an Azure OpenAI resource
  • Select and deploy an Azure OpenAI model
  • Configure rate limits and manage deployment types

Overview

Azure OpenAI Service provides REST API access to OpenAI's powerful language models including GPT-4o, GPT-4o-mini, and embedding models. Provisioning requires selecting the appropriate SKU (S0 for standard consumption) and understanding the deployment options available: Standard (shared infrastructure, pay-per-token), Global Standard (optimized routing across regions), and Provisioned Throughput Units (PTU) for guaranteed capacity.

Each deployment is subject to rate limits measured in Tokens Per Minute (TPM) and Requests Per Minute (RPM). When limits are exceeded, the service returns HTTP 429 responses with Retry-After headers. Production applications must implement retry strategies with exponential backoff to handle throttling gracefully.

API versions follow the format YYYY-MM-DD with preview suffixes for pre-GA features. Applications should target stable API versions (e.g., 2024-10-21) and plan for version retirement, which is announced at least 90 days in advance.

Architecture

This challenge provisions an Azure OpenAI resource, deploys models with specific capacity configurations, and tests rate-limiting behavior and retry strategies.

Challenge 16 topology

Prerequisites

  • Azure subscription with Azure OpenAI access approved
  • Azure CLI 2.60+ installed
  • Python 3.9+ with openai and azure-identity packages
  • .NET 8 SDK with Azure.AI.OpenAI NuGet package

Implementation

Task 1: Provision Azure OpenAI Resource

Create an Azure OpenAI resource with the S0 SKU in a supported region.

# Provisioning is done via Azure CLI or ARM—use the resource with Python SDK
import os
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Option 1: API Key authentication
client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)

# Option 2: Microsoft Entra ID authentication (recommended)
token_provider = get_bearer_token_provider(
DefaultAzureCredential(),
"https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
azure_ad_token_provider=token_provider,
api_version="2024-10-21"
)

# Verify connectivity
response = client.chat.completions.create(
model="gpt-4o", # This is the deployment name
messages=[{"role": "user", "content": "Hello, confirm connection."}],
max_tokens=10
)
print(f"Connected successfully: {response.choices[0].message.content}")

Task 2: Deploy GPT-4o with Specific Capacity

Deploy a GPT-4o model with Standard deployment type and configure TPM capacity.

# Model deployment is managed via Azure CLI or REST management API
# After deployment, test with the Python SDK
import os
from openai import AzureOpenAI

client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)

# Test the deployed model
response = client.chat.completions.create(
model="gpt-4o", # deployment name
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain Azure OpenAI deployment types in one sentence."}
],
max_tokens=100
)

print(f"Response: {response.choices[0].message.content}")
print(f"Tokens used - Prompt: {response.usage.prompt_tokens}, "
f"Completion: {response.usage.completion_tokens}")

Task 3: Test Rate Limits and Implement Exponential Backoff

Send requests to observe rate limiting behavior and implement proper retry logic.

import os
import time
from openai import AzureOpenAI, RateLimitError

client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)

def call_with_exponential_backoff(messages, max_retries=5, base_delay=1.0):
"""Implement exponential backoff for rate-limited requests."""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=50
)
return response
except RateLimitError as e:
if attempt == max_retries - 1:
raise
# Use Retry-After header if available, otherwise exponential backoff
retry_after = getattr(e, "retry_after", None)
delay = retry_after if retry_after else base_delay * (2 ** attempt)
print(f"Rate limited. Retrying in {delay:.1f}s (attempt {attempt + 1})")
time.sleep(delay)

# Simulate high-volume requests to trigger rate limiting
results = []
for i in range(20):
try:
response = call_with_exponential_backoff(
[{"role": "user", "content": f"Say the number {i}"}]
)
results.append(response.choices[0].message.content)
print(f"Request {i}: Success")
except RateLimitError:
print(f"Request {i}: Exhausted retries")

print(f"\nCompleted {len(results)}/20 requests")

Task 4: Compare Standard vs Global Standard Deployments

import os
import time
from openai import AzureOpenAI

client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)

def measure_latency(deployment_name, num_requests=5):
"""Measure average latency for a deployment."""
latencies = []
for _ in range(num_requests):
start = time.time()
response = client.chat.completions.create(
model=deployment_name,
messages=[{"role": "user", "content": "Respond with OK."}],
max_tokens=5
)
latencies.append(time.time() - start)
return {
"deployment": deployment_name,
"avg_latency_ms": sum(latencies) / len(latencies) * 1000,
"min_latency_ms": min(latencies) * 1000,
"max_latency_ms": max(latencies) * 1000
}

# Compare Standard vs Global Standard deployments
standard_results = measure_latency("gpt-4o") # Standard deployment
global_results = measure_latency("gpt-4o-mini") # Global Standard deployment

print("Standard Deployment:")
print(f" Avg: {standard_results['avg_latency_ms']:.0f}ms | "
f"Min: {standard_results['min_latency_ms']:.0f}ms | "
f"Max: {standard_results['max_latency_ms']:.0f}ms")

print("\nGlobal Standard Deployment:")
print(f" Avg: {global_results['avg_latency_ms']:.0f}ms | "
f"Min: {global_results['min_latency_ms']:.0f}ms | "
f"Max: {global_results['max_latency_ms']:.0f}ms")

Expected Output

Connected successfully: Hello! Connection confirmed.
Response: Standard uses shared compute with pay-per-token, Global Standard optimizes
routing across regions, and Provisioned (PTU) guarantees dedicated throughput capacity.
Tokens used - Prompt: 22, Completion: 31

Rate limited. Retrying in 1.0s (attempt 1)
Request 0: Success
...
Completed 18/20 requests

Standard Deployment:
Avg: 450ms | Min: 320ms | Max: 680ms
Global Standard Deployment:
Avg: 380ms | Min: 280ms | Max: 520ms

Break & fix

ScenarioSymptomRoot CauseFix
Resource creation failsInvalidApiProperties errorRegion doesn't support Azure OpenAIUse supported region (eastus, eastus2, westus, etc.)
Deployment failsModelNotAvailableModel not available in selected regionCheck model availability matrix or change region
API returns 401Access denied due to invalid subscription keyWrong key or endpoint mismatchVerify key matches resource; check endpoint URL
API returns 429Rate limit is exceededExceeded TPM or RPM limitImplement exponential backoff; increase capacity
API returns 404Resource not foundWrong deployment name in requestVerify deployment name matches exactly

Knowledge Check

1. Which SKU is required when creating an Azure OpenAI resource via Azure CLI?

2. What deployment type provides guaranteed throughput capacity with a fixed monthly cost?

3. When Azure OpenAI returns HTTP 429, which header indicates how long to wait before retrying?

4. What is the capacity unit for Standard deployments when configuring rate limits?

5. Which API version format does Azure OpenAI use, and what happens when a version is retired?

Cleanup

az group delete --name rg-ai102-challenge16 --yes --no-wait

Learn More