Challenge 12: Deploy Generative AI Models
45-60 min | Cost: ~$1.00 (estimated) | Domain: Generative AI Solutions (15-20%)
Exam skills covered
- Deploy appropriate generative AI models for specific use cases
- Configure model deployment parameters including quotas and rate limits
- Compare deployment types (Standard, Global Standard, Provisioned)
Overview
Azure OpenAI Service provides access to a variety of generative AI models through a managed deployment model. Choosing the right model and deployment type is a critical skill for the AI-102 exam. The model catalog includes GPT-4o (multimodal, high capability), GPT-4o-mini (cost-efficient for simpler tasks), and open-source models like Phi-4, Mistral, and Llama available through Models as a Service (MaaS).
Deployment types determine how your model is hosted and billed. Standard deployments use shared compute with pay-per-token billing and are subject to Tokens-Per-Minute (TPM) and Requests-Per-Minute (RPM) quotas. Global Standard deployments route traffic globally for higher availability and throughput. Provisioned (PTU) deployments reserve dedicated compute capacity, providing guaranteed throughput for production workloads with predictable costs.
Understanding quotas is essential—each subscription has TPM limits per model per region. When you deploy a model, you allocate a portion of your available quota. Rate limiting (HTTP 429) occurs when requests exceed the allocated TPM/RPM. Monitoring quota usage and planning capacity across deployments is a key operational skill.
Architecture
The deployment architecture connects your application to Azure OpenAI endpoints through model deployments configured with specific SKUs and quota allocations.
Prerequisites
- Azure subscription with Azure OpenAI access approved
- Azure CLI with
cognitiveservicesextension - An existing Azure OpenAI resource (or permissions to create one)
- Sufficient quota in target region for GPT-4o and GPT-4o-mini
Implementation
Task 1: List Available Models and Deploy GPT-4o
- Python SDK
- C# SDK
- REST API
import os
from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
credential = DefaultAzureCredential()
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "rg-ai102-challenge12"
account_name = "aoai-ai102-challenge12"
client = CognitiveServicesManagementClient(credential, subscription_id)
# List available models for the account
models = client.accounts.list_models(
resource_group_name=resource_group,
account_name=account_name
)
print("Available models:")
for model in models:
print(f" {model.model.name} ({model.model.version}) - {model.model.format}")
# Create GPT-4o deployment (Standard)
from azure.mgmt.cognitiveservices.models import Deployment, DeploymentModel, Sku
deployment = Deployment(
sku=Sku(name="Standard", capacity=30), # 30K TPM
properties={
"model": DeploymentModel(
format="OpenAI",
name="gpt-4o",
version="2024-08-06"
)
}
)
poller = client.deployments.begin_create_or_update(
resource_group_name=resource_group,
account_name=account_name,
deployment_name="gpt-4o-standard",
deployment=deployment
)
result = poller.result()
print(f"\nDeployed: {result.name}")
print(f" Model: {result.properties.model.name} v{result.properties.model.version}")
print(f" SKU: {result.sku.name} ({result.sku.capacity}K TPM)")
using Azure.Identity;
using Azure.ResourceManager;
using Azure.ResourceManager.CognitiveServices;
using Azure.ResourceManager.CognitiveServices.Models;
var credential = new DefaultAzureCredential();
var client = new ArmClient(credential);
string subscriptionId = "YOUR_SUBSCRIPTION_ID";
string resourceGroup = "rg-ai102-challenge12";
string accountName = "aoai-ai102-challenge12";
var accountId = CognitiveServicesAccountResource.CreateResourceIdentifier(
subscriptionId, resourceGroup, accountName);
var account = client.GetCognitiveServicesAccountResource(accountId);
// List available models
var models = account.GetModelsAsync();
await foreach (var model in models)
{
Console.WriteLine($" {model.Model.Name} ({model.Model.Version})");
}
// Create GPT-4o deployment
var deployments = account.GetCognitiveServicesAccountDeployments();
var deploymentData = new CognitiveServicesAccountDeploymentData
{
Sku = new CognitiveServicesSku("Standard") { Capacity = 30 },
Properties = new CognitiveServicesAccountDeploymentProperties
{
Model = new CognitiveServicesAccountDeploymentModel
{
Format = "OpenAI",
Name = "gpt-4o",
Version = "2024-08-06"
}
}
};
var operation = await deployments.CreateOrUpdateAsync(
Azure.WaitUntil.Completed, "gpt-4o-standard", deploymentData);
Console.WriteLine($"Deployed: {operation.Value.Data.Name}");
Console.WriteLine($" Capacity: {operation.Value.Data.Sku.Capacity}K TPM");
SUBSCRIPTION_ID="YOUR_SUBSCRIPTION_ID"
RESOURCE_GROUP="rg-ai102-challenge12"
ACCOUNT_NAME="aoai-ai102-challenge12"
LOCATION="eastus2"
# Create resource group and OpenAI account
az group create --name $RESOURCE_GROUP --location $LOCATION
az cognitiveservices account create \
--name $ACCOUNT_NAME \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--kind OpenAI \
--sku S0
# List available models
az cognitiveservices account list-models \
--name $ACCOUNT_NAME \
--resource-group $RESOURCE_GROUP \
--output table
# Deploy GPT-4o (Standard, 30K TPM)
az cognitiveservices account deployment create \
--name $ACCOUNT_NAME \
--resource-group $RESOURCE_GROUP \
--deployment-name "gpt-4o-standard" \
--model-name "gpt-4o" \
--model-version "2024-08-06" \
--model-format "OpenAI" \
--sku-name "Standard" \
--sku-capacity 30
Task 2: Deploy GPT-4o-mini for Cost Comparison
- Python SDK
- C# SDK
- REST API
from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
from azure.mgmt.cognitiveservices.models import Deployment, DeploymentModel, Sku
credential = DefaultAzureCredential()
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "rg-ai102-challenge12"
account_name = "aoai-ai102-challenge12"
client = CognitiveServicesManagementClient(credential, subscription_id)
# Deploy GPT-4o-mini (Global Standard for higher throughput)
deployment_mini = Deployment(
sku=Sku(name="GlobalStandard", capacity=50), # 50K TPM
properties={
"model": DeploymentModel(
format="OpenAI",
name="gpt-4o-mini",
version="2024-07-18"
)
}
)
poller = client.deployments.begin_create_or_update(
resource_group_name=resource_group,
account_name=account_name,
deployment_name="gpt-4o-mini-global",
deployment=deployment_mini
)
result = poller.result()
print(f"Deployed: {result.name}")
print(f" Model: {result.properties.model.name}")
print(f" SKU: {result.sku.name} ({result.sku.capacity}K TPM)")
# Compare deployments
deployments = client.deployments.list(
resource_group_name=resource_group,
account_name=account_name
)
print("\n--- Deployment Comparison ---")
print(f"{'Name':<25} {'Model':<15} {'SKU':<18} {'TPM':<8}")
print("-" * 70)
for d in deployments:
print(f"{d.name:<25} {d.properties.model.name:<15} {d.sku.name:<18} {d.sku.capacity}K")
using Azure.Identity;
using Azure.ResourceManager;
using Azure.ResourceManager.CognitiveServices;
using Azure.ResourceManager.CognitiveServices.Models;
var credential = new DefaultAzureCredential();
var client = new ArmClient(credential);
string subscriptionId = "YOUR_SUBSCRIPTION_ID";
string resourceGroup = "rg-ai102-challenge12";
string accountName = "aoai-ai102-challenge12";
var accountId = CognitiveServicesAccountResource.CreateResourceIdentifier(
subscriptionId, resourceGroup, accountName);
var account = client.GetCognitiveServicesAccountResource(accountId);
var deployments = account.GetCognitiveServicesAccountDeployments();
// Deploy GPT-4o-mini with Global Standard SKU
var miniDeployment = new CognitiveServicesAccountDeploymentData
{
Sku = new CognitiveServicesSku("GlobalStandard") { Capacity = 50 },
Properties = new CognitiveServicesAccountDeploymentProperties
{
Model = new CognitiveServicesAccountDeploymentModel
{
Format = "OpenAI",
Name = "gpt-4o-mini",
Version = "2024-07-18"
}
}
};
var operation = await deployments.CreateOrUpdateAsync(
Azure.WaitUntil.Completed, "gpt-4o-mini-global", miniDeployment);
Console.WriteLine($"Deployed: {operation.Value.Data.Name}");
// List all deployments for comparison
Console.WriteLine("\n--- Deployment Comparison ---");
await foreach (var d in deployments.GetAllAsync())
{
Console.WriteLine($"{d.Data.Name,-25} {d.Data.Properties.Model.Name,-15} " +
$"{d.Data.Sku.Name,-18} {d.Data.Sku.Capacity}K TPM");
}
# Deploy GPT-4o-mini with Global Standard
az cognitiveservices account deployment create \
--name $ACCOUNT_NAME \
--resource-group $RESOURCE_GROUP \
--deployment-name "gpt-4o-mini-global" \
--model-name "gpt-4o-mini" \
--model-version "2024-07-18" \
--model-format "OpenAI" \
--sku-name "GlobalStandard" \
--sku-capacity 50
# List all deployments
az cognitiveservices account deployment list \
--name $ACCOUNT_NAME \
--resource-group $RESOURCE_GROUP \
--output table
Task 3: Check Quota Usage
- Python SDK
- C# SDK
- REST API
from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
credential = DefaultAzureCredential()
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "rg-ai102-challenge12"
account_name = "aoai-ai102-challenge12"
location = "eastus2"
client = CognitiveServicesManagementClient(credential, subscription_id)
# Check model quota/usage for the subscription in this region
usages = client.usages.list(location=location)
print(f"Quota usage for {location}:")
print(f"{'Model':<30} {'Used':<10} {'Limit':<10} {'Unit':<10}")
print("-" * 60)
for usage in usages:
if usage.current_value > 0 or "OpenAI" in (usage.name.value or ""):
print(f"{usage.name.localized_value:<30} "
f"{usage.current_value:<10} "
f"{usage.limit:<10} "
f"{usage.unit:<10}")
# Check deployment-level rate limits
deployments = client.deployments.list(
resource_group_name=resource_group,
account_name=account_name
)
print("\n--- Rate Limits per Deployment ---")
for d in deployments:
tpm = d.sku.capacity
# RPM is typically 6x TPM in thousands for standard
estimated_rpm = tpm * 6
print(f"{d.name}: {tpm}K TPM, ~{estimated_rpm} RPM")
using Azure.Identity;
using Azure.ResourceManager;
using Azure.ResourceManager.CognitiveServices;
var credential = new DefaultAzureCredential();
var client = new ArmClient(credential);
string subscriptionId = "YOUR_SUBSCRIPTION_ID";
string resourceGroup = "rg-ai102-challenge12";
string accountName = "aoai-ai102-challenge12";
var subscription = await client.GetDefaultSubscriptionAsync();
// Check usages for the account
var accountId = CognitiveServicesAccountResource.CreateResourceIdentifier(
subscriptionId, resourceGroup, accountName);
var account = client.GetCognitiveServicesAccountResource(accountId);
var usages = account.GetUsagesAsync();
Console.WriteLine("Account Usage:");
await foreach (var usage in usages)
{
Console.WriteLine($" {usage.Name?.LocalizedValue}: " +
$"{usage.CurrentValue}/{usage.Limit} ({usage.Unit})");
}
// List deployments with capacity info
var deployments = account.GetCognitiveServicesAccountDeployments();
Console.WriteLine("\n--- Rate Limits per Deployment ---");
await foreach (var d in deployments.GetAllAsync())
{
var tpm = d.Data.Sku.Capacity;
Console.WriteLine($" {d.Data.Name}: {tpm}K TPM");
}
# Check quota usage for a specific model in your region
az cognitiveservices usage list \
--location $LOCATION \
--output table
# Show deployment details including capacity
az cognitiveservices account deployment show \
--name $ACCOUNT_NAME \
--resource-group $RESOURCE_GROUP \
--deployment-name "gpt-4o-standard" \
--query "{name:name, model:properties.model.name, sku:sku.name, capacity:sku.capacity}"
# REST API - check quota
TOKEN=$(az account get-access-token --query accessToken -o tsv)
curl -s \
"https://management.azure.com/subscriptions/${SUBSCRIPTION_ID}/providers/Microsoft.CognitiveServices/locations/${LOCATION}/usages?api-version=2024-04-01-preview" \
-H "Authorization: Bearer $TOKEN" | jq '.value[] | select(.currentValue > 0)'
Task 4: Test Deployments
- Python SDK
- C# SDK
- REST API
import os
from openai import AzureOpenAI
endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
api_key = os.environ["AZURE_OPENAI_KEY"]
client = AzureOpenAI(
azure_endpoint=endpoint,
api_key=api_key,
api_version="2024-10-21"
)
test_prompt = "Explain the difference between GPT-4o and GPT-4o-mini in 2 sentences."
# Test GPT-4o
response_4o = client.chat.completions.create(
model="gpt-4o-standard",
messages=[{"role": "user", "content": test_prompt}],
max_tokens=150
)
print(f"GPT-4o response:")
print(f" {response_4o.choices[0].message.content}")
print(f" Tokens: {response_4o.usage.total_tokens}")
# Test GPT-4o-mini
response_mini = client.chat.completions.create(
model="gpt-4o-mini-global",
messages=[{"role": "user", "content": test_prompt}],
max_tokens=150
)
print(f"\nGPT-4o-mini response:")
print(f" {response_mini.choices[0].message.content}")
print(f" Tokens: {response_mini.usage.total_tokens}")
# Cost comparison (approximate pricing)
print("\n--- Cost Comparison (approximate) ---")
print(f"GPT-4o: Input ${5.00}/1M tokens, Output ${15.00}/1M tokens")
print(f"GPT-4o-mini: Input ${0.15}/1M tokens, Output ${0.60}/1M tokens")
using Azure;
using Azure.AI.OpenAI;
string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;
var client = new AzureOpenAIClient(
new Uri(endpoint), new AzureKeyCredential(apiKey));
string testPrompt = "Explain the difference between GPT-4o and GPT-4o-mini in 2 sentences.";
// Test GPT-4o
var chatClient4o = client.GetChatClient("gpt-4o-standard");
var response4o = await chatClient4o.CompleteChatAsync(
new[] { new Azure.AI.OpenAI.Chat.UserChatMessage(testPrompt) });
Console.WriteLine("GPT-4o response:");
Console.WriteLine($" {response4o.Value.Content[0].Text}");
Console.WriteLine($" Tokens: {response4o.Value.Usage.TotalTokenCount}");
// Test GPT-4o-mini
var chatClientMini = client.GetChatClient("gpt-4o-mini-global");
var responseMini = await chatClientMini.CompleteChatAsync(
new[] { new Azure.AI.OpenAI.Chat.UserChatMessage(testPrompt) });
Console.WriteLine("\nGPT-4o-mini response:");
Console.WriteLine($" {responseMini.Value.Content[0].Text}");
Console.WriteLine($" Tokens: {responseMini.Value.Usage.TotalTokenCount}");
AZURE_OPENAI_ENDPOINT="https://aoai-ai102-challenge12.openai.azure.com"
AZURE_OPENAI_KEY="YOUR_KEY"
# Test GPT-4o deployment
curl -s "${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o-standard/chat/completions?api-version=2024-10-21" \
-H "Content-Type: application/json" \
-H "api-key: ${AZURE_OPENAI_KEY}" \
-d '{
"messages": [{"role": "user", "content": "Explain GPT-4o vs GPT-4o-mini in 2 sentences."}],
"max_tokens": 150
}' | jq '{content: .choices[0].message.content, tokens: .usage.total_tokens}'
# Test GPT-4o-mini deployment
curl -s "${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o-mini-global/chat/completions?api-version=2024-10-21" \
-H "Content-Type: application/json" \
-H "api-key: ${AZURE_OPENAI_KEY}" \
-d '{
"messages": [{"role": "user", "content": "Explain GPT-4o vs GPT-4o-mini in 2 sentences."}],
"max_tokens": 150
}' | jq '{content: .choices[0].message.content, tokens: .usage.total_tokens}'
Expected Output
After completing all tasks, you should have:
- Azure OpenAI resource
aoai-ai102-challenge12with two deployments:gpt-4o-standard— Standard SKU, 30K TPM, model version 2024-08-06gpt-4o-mini-global— GlobalStandard SKU, 50K TPM, model version 2024-07-18
- Quota consumed: 30K TPM from GPT-4o quota, 50K TPM from GPT-4o-mini quota
- Successful test responses from both deployments showing different response styles
Break & fix
| Scenario | Symptom | Root Cause | Fix |
|---|---|---|---|
| Deployment fails | QuotaExceeded error | Insufficient TPM quota in region | Reduce capacity or request quota increase via Azure Portal |
| Model not found | ModelNotFound or empty model list | Model not available in selected region | Check regional availability; try eastus2 or swedencentral |
| 429 Too Many Requests | Rate limit errors during testing | Requests exceed allocated TPM/RPM | Implement exponential backoff; increase deployment capacity |
| Wrong model version | InvalidModelVersion | Specified version retired or not yet available | Use az cognitiveservices account list-models to find valid versions |
| Global Standard unavailable | SKU not supported | Not all models support Global Standard | Use Standard SKU or check model-SKU compatibility docs |
Knowledge Check
1. What is the primary difference between Standard and Provisioned deployment types?
2. When deploying a model, what does the 'capacity' parameter in the SKU represent?
3. Which model would be most cost-effective for a high-volume classification task that doesn't require advanced reasoning?
4. What happens when a deployment's rate limit (TPM/RPM) is exceeded?
5. What Azure CLI command deploys a GPT-4o model to an Azure OpenAI resource?
Cleanup
az group delete --name rg-ai102-challenge12 --yes --no-wait