Skip to main content

Challenge 12: Deploy Generative AI Models

Estimated Time

45-60 min | Cost: ~$1.00 (estimated) | Domain: Generative AI Solutions (15-20%)

Exam skills covered

  • Deploy appropriate generative AI models for specific use cases
  • Configure model deployment parameters including quotas and rate limits
  • Compare deployment types (Standard, Global Standard, Provisioned)

Overview

Azure OpenAI Service provides access to a variety of generative AI models through a managed deployment model. Choosing the right model and deployment type is a critical skill for the AI-102 exam. The model catalog includes GPT-4o (multimodal, high capability), GPT-4o-mini (cost-efficient for simpler tasks), and open-source models like Phi-4, Mistral, and Llama available through Models as a Service (MaaS).

Deployment types determine how your model is hosted and billed. Standard deployments use shared compute with pay-per-token billing and are subject to Tokens-Per-Minute (TPM) and Requests-Per-Minute (RPM) quotas. Global Standard deployments route traffic globally for higher availability and throughput. Provisioned (PTU) deployments reserve dedicated compute capacity, providing guaranteed throughput for production workloads with predictable costs.

Understanding quotas is essential—each subscription has TPM limits per model per region. When you deploy a model, you allocate a portion of your available quota. Rate limiting (HTTP 429) occurs when requests exceed the allocated TPM/RPM. Monitoring quota usage and planning capacity across deployments is a key operational skill.

Architecture

The deployment architecture connects your application to Azure OpenAI endpoints through model deployments configured with specific SKUs and quota allocations.

Challenge 12 topology

Prerequisites

  • Azure subscription with Azure OpenAI access approved
  • Azure CLI with cognitiveservices extension
  • An existing Azure OpenAI resource (or permissions to create one)
  • Sufficient quota in target region for GPT-4o and GPT-4o-mini

Implementation

Task 1: List Available Models and Deploy GPT-4o

import os
from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient

credential = DefaultAzureCredential()
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "rg-ai102-challenge12"
account_name = "aoai-ai102-challenge12"

client = CognitiveServicesManagementClient(credential, subscription_id)

# List available models for the account
models = client.accounts.list_models(
resource_group_name=resource_group,
account_name=account_name
)
print("Available models:")
for model in models:
print(f" {model.model.name} ({model.model.version}) - {model.model.format}")

# Create GPT-4o deployment (Standard)
from azure.mgmt.cognitiveservices.models import Deployment, DeploymentModel, Sku

deployment = Deployment(
sku=Sku(name="Standard", capacity=30), # 30K TPM
properties={
"model": DeploymentModel(
format="OpenAI",
name="gpt-4o",
version="2024-08-06"
)
}
)

poller = client.deployments.begin_create_or_update(
resource_group_name=resource_group,
account_name=account_name,
deployment_name="gpt-4o-standard",
deployment=deployment
)
result = poller.result()
print(f"\nDeployed: {result.name}")
print(f" Model: {result.properties.model.name} v{result.properties.model.version}")
print(f" SKU: {result.sku.name} ({result.sku.capacity}K TPM)")

Task 2: Deploy GPT-4o-mini for Cost Comparison

from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
from azure.mgmt.cognitiveservices.models import Deployment, DeploymentModel, Sku

credential = DefaultAzureCredential()
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "rg-ai102-challenge12"
account_name = "aoai-ai102-challenge12"

client = CognitiveServicesManagementClient(credential, subscription_id)

# Deploy GPT-4o-mini (Global Standard for higher throughput)
deployment_mini = Deployment(
sku=Sku(name="GlobalStandard", capacity=50), # 50K TPM
properties={
"model": DeploymentModel(
format="OpenAI",
name="gpt-4o-mini",
version="2024-07-18"
)
}
)

poller = client.deployments.begin_create_or_update(
resource_group_name=resource_group,
account_name=account_name,
deployment_name="gpt-4o-mini-global",
deployment=deployment_mini
)
result = poller.result()
print(f"Deployed: {result.name}")
print(f" Model: {result.properties.model.name}")
print(f" SKU: {result.sku.name} ({result.sku.capacity}K TPM)")

# Compare deployments
deployments = client.deployments.list(
resource_group_name=resource_group,
account_name=account_name
)
print("\n--- Deployment Comparison ---")
print(f"{'Name':<25} {'Model':<15} {'SKU':<18} {'TPM':<8}")
print("-" * 70)
for d in deployments:
print(f"{d.name:<25} {d.properties.model.name:<15} {d.sku.name:<18} {d.sku.capacity}K")

Task 3: Check Quota Usage

from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient

credential = DefaultAzureCredential()
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "rg-ai102-challenge12"
account_name = "aoai-ai102-challenge12"
location = "eastus2"

client = CognitiveServicesManagementClient(credential, subscription_id)

# Check model quota/usage for the subscription in this region
usages = client.usages.list(location=location)
print(f"Quota usage for {location}:")
print(f"{'Model':<30} {'Used':<10} {'Limit':<10} {'Unit':<10}")
print("-" * 60)
for usage in usages:
if usage.current_value > 0 or "OpenAI" in (usage.name.value or ""):
print(f"{usage.name.localized_value:<30} "
f"{usage.current_value:<10} "
f"{usage.limit:<10} "
f"{usage.unit:<10}")

# Check deployment-level rate limits
deployments = client.deployments.list(
resource_group_name=resource_group,
account_name=account_name
)
print("\n--- Rate Limits per Deployment ---")
for d in deployments:
tpm = d.sku.capacity
# RPM is typically 6x TPM in thousands for standard
estimated_rpm = tpm * 6
print(f"{d.name}: {tpm}K TPM, ~{estimated_rpm} RPM")

Task 4: Test Deployments

import os
from openai import AzureOpenAI

endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
api_key = os.environ["AZURE_OPENAI_KEY"]

client = AzureOpenAI(
azure_endpoint=endpoint,
api_key=api_key,
api_version="2024-10-21"
)

test_prompt = "Explain the difference between GPT-4o and GPT-4o-mini in 2 sentences."

# Test GPT-4o
response_4o = client.chat.completions.create(
model="gpt-4o-standard",
messages=[{"role": "user", "content": test_prompt}],
max_tokens=150
)
print(f"GPT-4o response:")
print(f" {response_4o.choices[0].message.content}")
print(f" Tokens: {response_4o.usage.total_tokens}")

# Test GPT-4o-mini
response_mini = client.chat.completions.create(
model="gpt-4o-mini-global",
messages=[{"role": "user", "content": test_prompt}],
max_tokens=150
)
print(f"\nGPT-4o-mini response:")
print(f" {response_mini.choices[0].message.content}")
print(f" Tokens: {response_mini.usage.total_tokens}")

# Cost comparison (approximate pricing)
print("\n--- Cost Comparison (approximate) ---")
print(f"GPT-4o: Input ${5.00}/1M tokens, Output ${15.00}/1M tokens")
print(f"GPT-4o-mini: Input ${0.15}/1M tokens, Output ${0.60}/1M tokens")

Expected Output

After completing all tasks, you should have:

  1. Azure OpenAI resource aoai-ai102-challenge12 with two deployments:
    • gpt-4o-standard — Standard SKU, 30K TPM, model version 2024-08-06
    • gpt-4o-mini-global — GlobalStandard SKU, 50K TPM, model version 2024-07-18
  2. Quota consumed: 30K TPM from GPT-4o quota, 50K TPM from GPT-4o-mini quota
  3. Successful test responses from both deployments showing different response styles

Break & fix

ScenarioSymptomRoot CauseFix
Deployment failsQuotaExceeded errorInsufficient TPM quota in regionReduce capacity or request quota increase via Azure Portal
Model not foundModelNotFound or empty model listModel not available in selected regionCheck regional availability; try eastus2 or swedencentral
429 Too Many RequestsRate limit errors during testingRequests exceed allocated TPM/RPMImplement exponential backoff; increase deployment capacity
Wrong model versionInvalidModelVersionSpecified version retired or not yet availableUse az cognitiveservices account list-models to find valid versions
Global Standard unavailableSKU not supportedNot all models support Global StandardUse Standard SKU or check model-SKU compatibility docs

Knowledge Check

1. What is the primary difference between Standard and Provisioned deployment types?

2. When deploying a model, what does the 'capacity' parameter in the SKU represent?

3. Which model would be most cost-effective for a high-volume classification task that doesn't require advanced reasoning?

4. What happens when a deployment's rate limit (TPM/RPM) is exceeded?

5. What Azure CLI command deploys a GPT-4o model to an Azure OpenAI resource?

Cleanup

az group delete --name rg-ai102-challenge12 --yes --no-wait

Learn More