Skip to main content

Challenge 19: Optimize Generative AI Solutions

Estimated Time

45-60 min | Cost: ~$5.00 (estimated, fine-tuning) | Domain: Generative AI Solutions (15-20%)

Exam skills covered

  • Configure parameters to optimize generative AI output
  • Implement monitoring and observability for generative AI solutions
  • Optimize scalability and performance
  • Implement tracing with Application Insights and OpenTelemetry
  • Prepare and submit fine-tuning jobs

Overview

Optimizing generative AI solutions requires attention to latency, cost, quality, and observability. Streaming responses improve perceived latency by delivering tokens incrementally rather than waiting for complete generation. Token optimization using libraries like tiktoken enables precise cost estimation and prompt compression. Together, these techniques reduce both actual and perceived response times.

Observability is critical for production AI systems. Azure OpenAI integrates with Application Insights through OpenTelemetry, providing end-to-end tracing of requests, token usage, latency distributions, and error rates. Custom spans and attributes enable tracking business-specific metrics like prompt template usage and response quality scores.

Fine-tuning allows customization of base models with domain-specific data. The workflow involves preparing training data in JSONL format (with system/user/assistant message pairs), uploading files, creating a fine-tuning job, and deploying the resulting custom model. Fine-tuned models can achieve better task-specific performance with shorter prompts, reducing both latency and per-request cost.

Architecture

This challenge implements streaming responses, configures OpenTelemetry tracing, optimizes token usage, and prepares a fine-tuning workflow.

Challenge 19 topology

Prerequisites

  • Azure OpenAI resource with GPT-4o deployed
  • Application Insights resource (connection string)
  • Python 3.9+ with openai, tiktoken, azure-monitor-opentelemetry packages
  • .NET 8 SDK with Azure.AI.OpenAI, Azure.Monitor.OpenTelemetry.AspNetCore NuGet packages
  • Training data in JSONL format (for fine-tuning task)

Implementation

Task 1: Implement Streaming Responses

import os
import time
from openai import AzureOpenAI

client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)

# Non-streaming: wait for complete response
start = time.time()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain 5 Azure AI services in detail."}],
max_tokens=500
)
non_stream_time = time.time() - start
print(f"Non-streaming: {non_stream_time:.2f}s total wait")
print(f"Response length: {len(response.choices[0].message.content)} chars\n")

# Streaming: receive tokens incrementally
start = time.time()
first_token_time = None
full_response = ""

stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain 5 Azure AI services in detail."}],
max_tokens=500,
stream=True
)

for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.time() - start
content = chunk.choices[0].delta.content
full_response += content
print(content, end="", flush=True)

total_stream_time = time.time() - start

print(f"\n\nStreaming: first token in {first_token_time:.2f}s, "
f"total {total_stream_time:.2f}s")
print(f"Response length: {len(full_response)} chars")
print(f"Time to first token improvement: {non_stream_time - first_token_time:.2f}s faster")

Task 2: Configure OpenTelemetry Tracing

import os
from openai import AzureOpenAI
from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import trace

# Configure Application Insights export
configure_azure_monitor(
connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"]
)

tracer = trace.get_tracer(__name__)

client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)

@tracer.start_as_current_span("chat_completion")
def get_completion(user_message: str, template_name: str = "default") -> str:
"""Traced chat completion with custom attributes."""
span = trace.get_current_span()
span.set_attribute("ai.prompt_template", template_name)
span.set_attribute("ai.model", "gpt-4o")
span.set_attribute("ai.user_message_length", len(user_message))

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": user_message}
],
max_tokens=200
)

# Record usage metrics as span attributes
span.set_attribute("ai.prompt_tokens", response.usage.prompt_tokens)
span.set_attribute("ai.completion_tokens", response.usage.completion_tokens)
span.set_attribute("ai.total_tokens", response.usage.total_tokens)
span.set_attribute("ai.finish_reason", response.choices[0].finish_reason)

return response.choices[0].message.content

# Execute traced requests
with tracer.start_as_current_span("user_interaction") as parent_span:
parent_span.set_attribute("user.session_id", "session-12345")

result1 = get_completion("What is Azure OpenAI?", "knowledge_qa")
result2 = get_completion("Summarize in one sentence.", "summarization")

parent_span.set_attribute("interaction.total_requests", 2)

print("Traces exported to Application Insights")
print(f"Result 1: {result1[:100]}...")
print(f"Result 2: {result2[:100]}...")

Task 3: Count and Optimize Tokens

import os
import tiktoken
from openai import AzureOpenAI

client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)

# Get the tokenizer for GPT-4o (uses o200k_base encoding)
encoding = tiktoken.encoding_for_model("gpt-4o")

def count_message_tokens(messages: list, model: str = "gpt-4o") -> int:
"""Count tokens for a list of chat messages."""
enc = tiktoken.encoding_for_model(model)
tokens_per_message = 3 # Every message has <|start|>role/name\n content<|end|>\n
tokens_per_name = 1

num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(enc.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3 # Every reply is primed with <|start|>assistant<|message|>
return num_tokens

# Example: Compare verbose vs. optimized prompts
verbose_messages = [
{"role": "system", "content": "You are a very helpful and knowledgeable assistant who always provides detailed, comprehensive, and thorough answers to any questions that users might ask about Azure cloud computing services and their various features and capabilities."},
{"role": "user", "content": "Could you please explain to me what Azure Cognitive Services is and what it does and how it works?"}
]

optimized_messages = [
{"role": "system", "content": "Azure technical assistant. Be concise."},
{"role": "user", "content": "What is Azure Cognitive Services?"}
]

verbose_tokens = count_message_tokens(verbose_messages)
optimized_tokens = count_message_tokens(optimized_messages)

print(f"Verbose prompt: {verbose_tokens} tokens")
print(f"Optimized prompt: {optimized_tokens} tokens")
print(f"Token savings: {verbose_tokens - optimized_tokens} tokens "
f"({(1 - optimized_tokens/verbose_tokens)*100:.0f}% reduction)")

# Verify with actual API call
response = client.chat.completions.create(
model="gpt-4o",
messages=optimized_messages,
max_tokens=100
)
print(f"\nActual prompt tokens (API): {response.usage.prompt_tokens}")
print(f"Local estimate: {optimized_tokens}")

# Truncation strategy for long contexts
def truncate_to_token_limit(text: str, max_tokens: int = 4000) -> str:
"""Truncate text to fit within token limit."""
tokens = encoding.encode(text)
if len(tokens) <= max_tokens:
return text
truncated_tokens = tokens[:max_tokens]
return encoding.decode(truncated_tokens)

long_text = "Azure provides many services. " * 500 # Simulate long input
truncated = truncate_to_token_limit(long_text, max_tokens=100)
print(f"\nOriginal: {len(encoding.encode(long_text))} tokens")
print(f"Truncated: {len(encoding.encode(truncated))} tokens")

Task 4: Prepare Fine-Tuning Data and Submit Job

import os
import json
from openai import AzureOpenAI

client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)

# Step 1: Prepare training data in JSONL format
training_data = [
{
"messages": [
{"role": "system", "content": "You are an Azure billing assistant."},
{"role": "user", "content": "What is a Reserved Instance?"},
{"role": "assistant", "content": "A Reserved Instance (RI) is a billing discount applied when you commit to a 1-year or 3-year term for specific Azure resources like VMs or SQL databases. RIs can save up to 72% compared to pay-as-you-go pricing."}
]
},
{
"messages": [
{"role": "system", "content": "You are an Azure billing assistant."},
{"role": "user", "content": "How do I view my Azure spending?"},
{"role": "assistant", "content": "Navigate to Cost Management + Billing in the Azure portal. Use Cost Analysis for detailed breakdowns by resource, service, or time period. Set up budgets and alerts to monitor spending proactively."}
]
},
{
"messages": [
{"role": "system", "content": "You are an Azure billing assistant."},
{"role": "user", "content": "What's the difference between PAYG and EA?"},
{"role": "assistant", "content": "Pay-As-You-Go (PAYG) charges per-usage with no commitment. Enterprise Agreement (EA) is a volume licensing contract with upfront monetary commitment, offering lower rates and centralized billing for organizations."}
]
}
]

# Write training file (minimum 10 examples required for fine-tuning)
with open("training_data.jsonl", "w") as f:
for entry in training_data:
f.write(json.dumps(entry) + "\n")

# Step 2: Upload training file
with open("training_data.jsonl", "rb") as f:
training_file = client.files.create(
file=f,
purpose="fine-tune"
)
print(f"Uploaded file ID: {training_file.id}")
print(f"Status: {training_file.status}")

# Step 3: Create fine-tuning job
fine_tuning_job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-4o-mini-2024-07-18", # Base model to fine-tune
hyperparameters={
"n_epochs": 3,
"batch_size": 1,
"learning_rate_multiplier": 1.0
},
suffix="azure-billing" # Custom model name suffix
)

print(f"\nFine-tuning job created: {fine_tuning_job.id}")
print(f"Status: {fine_tuning_job.status}")

# Step 4: Monitor fine-tuning progress
import time
while fine_tuning_job.status not in ("succeeded", "failed", "cancelled"):
time.sleep(30)
fine_tuning_job = client.fine_tuning.jobs.retrieve(fine_tuning_job.id)
print(f"Status: {fine_tuning_job.status}")

if fine_tuning_job.status == "succeeded":
print(f"Fine-tuned model: {fine_tuning_job.fine_tuned_model}")

Expected Output

Non-streaming: 3.45s total wait
Response length: 892 chars

Streaming: first token in 0.42s, total 3.51s
Response length: 892 chars
Time to first token improvement: 3.03s faster

Traces exported to Application Insights

Verbose prompt: 68 tokens
Optimized prompt: 21 tokens
Token savings: 47 tokens (69% reduction)

Actual prompt tokens (API): 21
Local estimate: 21

Uploaded file ID: file-abc123def456
Fine-tuning job created: ftjob-xyz789
Status: running
Status: succeeded
Fine-tuned model: ft:gpt-4o-mini-2024-07-18:azure-billing:abc123

Break & fix

ScenarioSymptomRoot CauseFix
Streaming returns empty chunksNo content in deltaNormal—some chunks contain only role/metadataFilter chunks where delta.content is not None/null
Token count mismatchLocal count differs from APITokenizer version mismatch or message overheadUse tiktoken with correct model; account for 3-token message overhead
Fine-tuning job failsStatus: failedTraining data format invalid or fewer than 10 examplesValidate JSONL format; ensure minimum 10 training examples
Traces not appearingNo data in Application InsightsConnection string misconfigured or ingestion delayVerify connection string; wait 2-5 minutes for ingestion
Fine-tuned model high latencySlower than base modelCustom model not optimized for deploymentIncrease SKU capacity; consider if fine-tuning is necessary vs. few-shot

Knowledge Check

1. What is the primary benefit of streaming responses in Azure OpenAI?

2. What is the minimum number of training examples required for fine-tuning in Azure OpenAI?

3. Which library is used to count tokens locally for GPT-4o before making API calls?

4. When configuring OpenTelemetry tracing for Azure OpenAI, which metric is most important for cost monitoring?

5. What format must training data use for Azure OpenAI fine-tuning?

Cleanup

az group delete --name rg-ai102-challenge19 --yes --no-wait

# Delete fine-tuning artifacts
# curl -X DELETE "https://${AZURE_OPENAI_ENDPOINT}/openai/files/{file-id}?api-version=2024-10-21" \
# -H "api-key: ${AZURE_OPENAI_KEY}"

Learn More