Challenge 16: Azure OpenAI: Provisioning and Configuration

Estimated Time

45-60 min | Cost: ~$1.00 (estimated) | Domain: Generative AI Solutions (15-20%)

Exam skills covered

Provision an Azure OpenAI resource
Select and deploy an Azure OpenAI model
Configure rate limits and manage deployment types

Overview

Azure OpenAI Service provides REST API access to OpenAI's powerful language models including GPT-4o, GPT-4o-mini, and embedding models. Provisioning requires selecting the appropriate SKU (S0 for standard consumption) and understanding the deployment options available: Standard (shared infrastructure, pay-per-token), Global Standard (optimized routing across regions), and Provisioned Throughput Units (PTU) for guaranteed capacity.

Each deployment is subject to rate limits measured in Tokens Per Minute (TPM) and Requests Per Minute (RPM). When limits are exceeded, the service returns HTTP 429 responses with Retry-After headers. Production applications must implement retry strategies with exponential backoff to handle throttling gracefully.

API versions follow the format YYYY-MM-DD with preview suffixes for pre-GA features. Applications should target stable API versions (e.g., 2024-10-21) and plan for version retirement, which is announced at least 90 days in advance.

Architecture

This challenge provisions an Azure OpenAI resource, deploys models with specific capacity configurations, and tests rate-limiting behavior and retry strategies.

Challenge 16 topology

Prerequisites

Azure subscription with Azure OpenAI access approved
Azure CLI 2.60+ installed
Python 3.9+ with openai and azure-identity packages
.NET 8 SDK with Azure.AI.OpenAI NuGet package

Implementation

Task 1: Provision Azure OpenAI Resource

Create an Azure OpenAI resource with the S0 SKU in a supported region.

Python SDK
C# SDK
REST API

# Provisioning is done via Azure CLI or ARM—use the resource with Python SDK
import os
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Option 1: API Key authentication
client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-10-21"
)

# Option 2: Microsoft Entra ID authentication (recommended)
token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    azure_ad_token_provider=token_provider,
    api_version="2024-10-21"
)

# Verify connectivity
response = client.chat.completions.create(
    model="gpt-4o",  # This is the deployment name
    messages=[{"role": "user", "content": "Hello, confirm connection."}],
    max_tokens=10
)
print(f"Connected successfully: {response.choices[0].message.content}")

using Azure;
using Azure.AI.OpenAI;
using Azure.Identity;
using OpenAI.Chat;

// Option 1: API Key authentication
string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;

AzureOpenAIClient azureClient = new(
    new Uri(endpoint),
    new AzureKeyCredential(apiKey));

// Option 2: Microsoft Entra ID authentication (recommended)
AzureOpenAIClient azureClientEntra = new(
    new Uri(endpoint),
    new DefaultAzureCredential());

// Get a ChatClient for a specific deployment
ChatClient chatClient = azureClient.GetChatClient("gpt-4o");

// Verify connectivity
ChatCompletion completion = await chatClient.CompleteChatAsync(
    new ChatMessage[] { new UserChatMessage("Hello, confirm connection.") },
    new ChatCompletionOptions { MaxOutputTokenCount = 10 });

Console.WriteLine($"Connected successfully: {completion.Content[0].Text}");

# Create resource group
az group create --name rg-ai102-challenge16 --location eastus2

# Create Azure OpenAI resource (S0 SKU)
az cognitiveservices account create \
  --name aoai-challenge16 \
  --resource-group rg-ai102-challenge16 \
  --location eastus2 \
  --kind OpenAI \
  --sku S0 \
  --custom-domain aoai-challenge16

# Get the endpoint and keys
az cognitiveservices account show \
  --name aoai-challenge16 \
  --resource-group rg-ai102-challenge16 \
  --query properties.endpoint -o tsv

az cognitiveservices account keys list \
  --name aoai-challenge16 \
  --resource-group rg-ai102-challenge16

# Verify with a direct REST call
curl -X POST "https://aoai-challenge16.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
  -H "Content-Type: application/json" \
  -H "api-key: ${AZURE_OPENAI_KEY}" \
  -d '{
    "messages": [{"role": "user", "content": "Hello, confirm connection."}],
    "max_tokens": 10
  }'

Task 2: Deploy GPT-4o with Specific Capacity

Deploy a GPT-4o model with Standard deployment type and configure TPM capacity.

Python SDK
C# SDK
REST API

# Model deployment is managed via Azure CLI or REST management API
# After deployment, test with the Python SDK
import os
from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-10-21"
)

# Test the deployed model
response = client.chat.completions.create(
    model="gpt-4o",  # deployment name
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Azure OpenAI deployment types in one sentence."}
    ],
    max_tokens=100
)

print(f"Response: {response.choices[0].message.content}")
print(f"Tokens used - Prompt: {response.usage.prompt_tokens}, "
      f"Completion: {response.usage.completion_tokens}")

using Azure;
using Azure.AI.OpenAI;
using OpenAI.Chat;

string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;

AzureOpenAIClient azureClient = new(
    new Uri(endpoint),
    new AzureKeyCredential(apiKey));

ChatClient chatClient = azureClient.GetChatClient("gpt-4o");

ChatCompletion completion = await chatClient.CompleteChatAsync(
    new ChatMessage[]
    {
        new SystemChatMessage("You are a helpful assistant."),
        new UserChatMessage("Explain Azure OpenAI deployment types in one sentence.")
    },
    new ChatCompletionOptions { MaxOutputTokenCount = 100 });

Console.WriteLine($"Response: {completion.Content[0].Text}");
Console.WriteLine($"Tokens used - Prompt: {completion.Usage.InputTokenCount}, "
    + $"Completion: {completion.Usage.OutputTokenCount}");

# Deploy GPT-4o with Standard deployment type and 30K TPM capacity
az cognitiveservices account deployment create \
  --name aoai-challenge16 \
  --resource-group rg-ai102-challenge16 \
  --deployment-name gpt-4o \
  --model-name gpt-4o \
  --model-version "2024-08-06" \
  --model-format OpenAI \
  --sku-name "Standard" \
  --sku-capacity 30

# Deploy GPT-4o-mini for cost-efficient workloads
az cognitiveservices account deployment create \
  --name aoai-challenge16 \
  --resource-group rg-ai102-challenge16 \
  --deployment-name gpt-4o-mini \
  --model-name gpt-4o-mini \
  --model-version "2024-07-18" \
  --model-format OpenAI \
  --sku-name "GlobalStandard" \
  --sku-capacity 50

# List deployments to verify
az cognitiveservices account deployment list \
  --name aoai-challenge16 \
  --resource-group rg-ai102-challenge16 \
  -o table

Task 3: Test Rate Limits and Implement Exponential Backoff

Send requests to observe rate limiting behavior and implement proper retry logic.

Python SDK
C# SDK
REST API

import os
import time
from openai import AzureOpenAI, RateLimitError

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-10-21"
)

def call_with_exponential_backoff(messages, max_retries=5, base_delay=1.0):
    """Implement exponential backoff for rate-limited requests."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                max_tokens=50
            )
            return response
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Use Retry-After header if available, otherwise exponential backoff
            retry_after = getattr(e, "retry_after", None)
            delay = retry_after if retry_after else base_delay * (2 ** attempt)
            print(f"Rate limited. Retrying in {delay:.1f}s (attempt {attempt + 1})")
            time.sleep(delay)

# Simulate high-volume requests to trigger rate limiting
results = []
for i in range(20):
    try:
        response = call_with_exponential_backoff(
            [{"role": "user", "content": f"Say the number {i}"}]
        )
        results.append(response.choices[0].message.content)
        print(f"Request {i}: Success")
    except RateLimitError:
        print(f"Request {i}: Exhausted retries")

print(f"\nCompleted {len(results)}/20 requests")

using Azure;
using Azure.AI.OpenAI;
using OpenAI.Chat;
using System.ClientModel;

string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;

AzureOpenAIClient azureClient = new(
    new Uri(endpoint),
    new AzureKeyCredential(apiKey));

ChatClient chatClient = azureClient.GetChatClient("gpt-4o");

async Task<ChatCompletion?> CallWithExponentialBackoff(
    ChatMessage[] messages, int maxRetries = 5, double baseDelay = 1.0)
{
    for (int attempt = 0; attempt < maxRetries; attempt++)
    {
        try
        {
            return await chatClient.CompleteChatAsync(
                messages,
                new ChatCompletionOptions { MaxOutputTokenCount = 50 });
        }
        catch (ClientResultException ex) when (ex.Status == 429)
        {
            if (attempt == maxRetries - 1) throw;
            double delay = baseDelay * Math.Pow(2, attempt);
            Console.WriteLine(
                $"Rate limited. Retrying in {delay:F1}s (attempt {attempt + 1})");
            await Task.Delay(TimeSpan.FromSeconds(delay));
        }
    }
    return null;
}

// Simulate high-volume requests
int successCount = 0;
for (int i = 0; i < 20; i++)
{
    try
    {
        var result = await CallWithExponentialBackoff(
            new ChatMessage[] { new UserChatMessage($"Say the number {i}") });
        if (result != null)
        {
            successCount++;
            Console.WriteLine($"Request {i}: Success");
        }
    }
    catch (ClientResultException)
    {
        Console.WriteLine($"Request {i}: Exhausted retries");
    }
}

Console.WriteLine($"\nCompleted {successCount}/20 requests");

# Send rapid requests to observe rate limiting (429 responses)
for i in $(seq 1 20); do
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
    -X POST "https://aoai-challenge16.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
    -H "Content-Type: application/json" \
    -H "api-key: ${AZURE_OPENAI_KEY}" \
    -d "{\"messages\": [{\"role\": \"user\", \"content\": \"Say ${i}\"}], \"max_tokens\": 10}")
  echo "Request $i: HTTP $HTTP_CODE"
done

# Check rate limit headers in response
curl -i -X POST "https://aoai-challenge16.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
  -H "Content-Type: application/json" \
  -H "api-key: ${AZURE_OPENAI_KEY}" \
  -d '{
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 10
  }' 2>/dev/null | grep -i "x-ratelimit\|retry-after"

# Headers to observe:
# x-ratelimit-remaining-tokens
# x-ratelimit-remaining-requests
# Retry-After (when 429)

Task 4: Compare Standard vs Global Standard Deployments

Python SDK
C# SDK
REST API

import os
import time
from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-10-21"
)

def measure_latency(deployment_name, num_requests=5):
    """Measure average latency for a deployment."""
    latencies = []
    for _ in range(num_requests):
        start = time.time()
        response = client.chat.completions.create(
            model=deployment_name,
            messages=[{"role": "user", "content": "Respond with OK."}],
            max_tokens=5
        )
        latencies.append(time.time() - start)
    return {
        "deployment": deployment_name,
        "avg_latency_ms": sum(latencies) / len(latencies) * 1000,
        "min_latency_ms": min(latencies) * 1000,
        "max_latency_ms": max(latencies) * 1000
    }

# Compare Standard vs Global Standard deployments
standard_results = measure_latency("gpt-4o")          # Standard deployment
global_results = measure_latency("gpt-4o-mini")       # Global Standard deployment

print("Standard Deployment:")
print(f"  Avg: {standard_results['avg_latency_ms']:.0f}ms | "
      f"Min: {standard_results['min_latency_ms']:.0f}ms | "
      f"Max: {standard_results['max_latency_ms']:.0f}ms")

print("\nGlobal Standard Deployment:")
print(f"  Avg: {global_results['avg_latency_ms']:.0f}ms | "
      f"Min: {global_results['min_latency_ms']:.0f}ms | "
      f"Max: {global_results['max_latency_ms']:.0f}ms")

using Azure;
using Azure.AI.OpenAI;
using OpenAI.Chat;
using System.Diagnostics;

string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;

AzureOpenAIClient azureClient = new(
    new Uri(endpoint),
    new AzureKeyCredential(apiKey));

async Task<(double avg, double min, double max)> MeasureLatency(
    string deploymentName, int numRequests = 5)
{
    ChatClient chatClient = azureClient.GetChatClient(deploymentName);
    var latencies = new List<double>();

    for (int i = 0; i < numRequests; i++)
    {
        var sw = Stopwatch.StartNew();
        await chatClient.CompleteChatAsync(
            new ChatMessage[] { new UserChatMessage("Respond with OK.") },
            new ChatCompletionOptions { MaxOutputTokenCount = 5 });
        sw.Stop();
        latencies.Add(sw.Elapsed.TotalMilliseconds);
    }

    return (latencies.Average(), latencies.Min(), latencies.Max());
}

var standard = await MeasureLatency("gpt-4o");
var global = await MeasureLatency("gpt-4o-mini");

Console.WriteLine($"Standard: Avg={standard.avg:F0}ms Min={standard.min:F0}ms Max={standard.max:F0}ms");
Console.WriteLine($"Global Standard: Avg={global.avg:F0}ms Min={global.min:F0}ms Max={global.max:F0}ms");

# Compare latencies between deployment types
echo "=== Standard Deployment (gpt-4o) ==="
for i in $(seq 1 5); do
  START=$(date +%s%N)
  curl -s -o /dev/null \
    -X POST "https://aoai-challenge16.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
    -H "Content-Type: application/json" \
    -H "api-key: ${AZURE_OPENAI_KEY}" \
    -d '{"messages": [{"role": "user", "content": "OK"}], "max_tokens": 5}'
  END=$(date +%s%N)
  echo "Request $i: $(( (END - START) / 1000000 ))ms"
done

echo ""
echo "=== Global Standard Deployment (gpt-4o-mini) ==="
for i in $(seq 1 5); do
  START=$(date +%s%N)
  curl -s -o /dev/null \
    -X POST "https://aoai-challenge16.openai.azure.com/openai/deployments/gpt-4o-mini/chat/completions?api-version=2024-10-21" \
    -H "Content-Type: application/json" \
    -H "api-key: ${AZURE_OPENAI_KEY}" \
    -d '{"messages": [{"role": "user", "content": "OK"}], "max_tokens": 5}'
  END=$(date +%s%N)
  echo "Request $i: $(( (END - START) / 1000000 ))ms"
done

Expected Output

Connected successfully: Hello! Connection confirmed.
Response: Standard uses shared compute with pay-per-token, Global Standard optimizes
routing across regions, and Provisioned (PTU) guarantees dedicated throughput capacity.
Tokens used - Prompt: 22, Completion: 31

Rate limited. Retrying in 1.0s (attempt 1)
Request 0: Success
...
Completed 18/20 requests

Standard Deployment:
  Avg: 450ms | Min: 320ms | Max: 680ms
Global Standard Deployment:
  Avg: 380ms | Min: 280ms | Max: 520ms

Break & fix

Scenario	Symptom	Root Cause	Fix
Resource creation fails	`InvalidApiProperties` error	Region doesn't support Azure OpenAI	Use supported region (eastus, eastus2, westus, etc.)
Deployment fails	`ModelNotAvailable`	Model not available in selected region	Check model availability matrix or change region
API returns 401	`Access denied due to invalid subscription key`	Wrong key or endpoint mismatch	Verify key matches resource; check endpoint URL
API returns 429	`Rate limit is exceeded`	Exceeded TPM or RPM limit	Implement exponential backoff; increase capacity
API returns 404	`Resource not found`	Wrong deployment name in request	Verify deployment name matches exactly

Knowledge Check

1. Which SKU is required when creating an Azure OpenAI resource via Azure CLI?

2. What deployment type provides guaranteed throughput capacity with a fixed monthly cost?

3. When Azure OpenAI returns HTTP 429, which header indicates how long to wait before retrying?

4. What is the capacity unit for Standard deployments when configuring rate limits?

5. Which API version format does Azure OpenAI use, and what happens when a version is retired?

Cleanup

az group delete --name rg-ai102-challenge16 --yes --no-wait

Exam skills covered​

Overview​

Architecture​

Prerequisites​

Implementation​

Task 1: Provision Azure OpenAI Resource​

Task 2: Deploy GPT-4o with Specific Capacity​

Task 3: Test Rate Limits and Implement Exponential Backoff​

Task 4: Compare Standard vs Global Standard Deployments​

Expected Output​

Break & fix​

Knowledge Check​

Cleanup​

Learn More​