Skip to main content

Challenge 18: DALL-E and Multimodal Models

Estimated Time

45-60 min | Cost: ~$2.00 (estimated, image generation) | Domain: Generative AI Solutions (15-20%)

Exam skills covered

  • Use DALL-E to generate images
  • Use large multimodal models (GPT-4o vision capabilities)

Overview

Azure OpenAI provides access to multimodal AI capabilities through two primary features: DALL-E 3 for image generation and GPT-4o for vision understanding. DALL-E 3 generates images from text descriptions, supporting sizes of 1024×1024, 1024×1792, and 1792×1024 with configurable quality settings (standard or HD). Each generation request produces a unique image with a temporary URL valid for 24 hours.

GPT-4o's vision capabilities allow the model to analyze images provided either as URLs or base64-encoded data. The model can describe image content, extract text (OCR), interpret charts and diagrams, compare multiple images, and answer questions about visual content. Images are processed as special content parts within the chat completions API, maintaining the familiar message structure.

When working with multimodal inputs, understanding token costs is important: image analysis costs vary by resolution. The detail parameter controls processing: low uses a fixed 85 tokens regardless of size, while high processes the image at full resolution with costs proportional to the number of 512×512 tiles needed to cover the image.

Architecture

This challenge generates images with DALL-E 3, analyzes images with GPT-4o vision, and explores OCR and chart understanding capabilities.

Challenge 18 topology

Prerequisites

  • Azure OpenAI resource with DALL-E 3 model deployed (deployment name: dall-e-3)
  • Azure OpenAI resource with GPT-4o model deployed (deployment name: gpt-4o)
  • Python 3.9+ with openai package installed
  • .NET 8 SDK with Azure.AI.OpenAI NuGet package
  • Sample images for vision analysis (URLs or local files)

Implementation

Task 1: Generate Images with DALL-E 3

import os
from openai import AzureOpenAI

client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)

# Generate an image with DALL-E 3
result = client.images.generate(
model="dall-e-3", # deployment name
prompt="A modern cloud data center with glowing blue network connections, "
"isometric 3D illustration style, clean white background",
size="1024x1024", # Options: 1024x1024, 1024x1792, 1792x1024
quality="standard", # Options: standard, hd
style="vivid", # Options: vivid, natural
n=1 # DALL-E 3 only supports n=1
)

image_url = result.data[0].url
revised_prompt = result.data[0].revised_prompt

print(f"Image URL: {image_url}")
print(f"Revised prompt: {revised_prompt}")

# Generate HD quality portrait image
result_hd = client.images.generate(
model="dall-e-3",
prompt="Professional headshot photo of a friendly AI robot assistant, "
"soft studio lighting, shallow depth of field",
size="1024x1792", # Portrait orientation
quality="hd", # Higher detail
style="natural", # More photorealistic
n=1
)

print(f"\nHD Image URL: {result_hd.data[0].url}")
print(f"Revised prompt: {result_hd.data[0].revised_prompt}")

Task 2: Analyze Images with GPT-4o Vision (URL Input)

import os
from openai import AzureOpenAI

client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)

# Analyze an image using a URL
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are an image analysis assistant. Describe images accurately and concisely."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in detail. What architecture components do you see?"
},
{
"type": "image_url",
"image_url": {
"url": "https://learn.microsoft.com/en-us/azure/architecture/guide/images/a]]rchitecture-styles/web-queue-worker-logical.svg",
"detail": "high" # Options: low, high, auto
}
}
]
}
],
max_tokens=500
)

print(f"Analysis: {response.choices[0].message.content}")
print(f"Tokens used: {response.usage.total_tokens}")

# Compare multiple images
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Compare these two architecture diagrams. What are the key differences?"},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/architecture-v1.png",
"detail": "high"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/architecture-v2.png",
"detail": "high"
}
}
]
}
],
max_tokens=500
)

print(f"\nComparison: {response.choices[0].message.content}")

Task 3: Extract Text from Images (OCR with Vision)

import os
import base64
from openai import AzureOpenAI

client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)

def encode_image_to_base64(image_path: str) -> str:
"""Encode a local image file to base64."""
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")

# Option 1: Base64-encoded local image
image_base64 = encode_image_to_base64("sample-document.png")

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are an OCR assistant. Extract all visible text from the image exactly as written. Preserve formatting where possible."
},
{
"role": "user",
"content": [
{"type": "text", "text": "Extract all text from this document image."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_base64}",
"detail": "high"
}
}
]
}
],
max_tokens=1000
)

print("Extracted text:")
print(response.choices[0].message.content)

# Option 2: Chart/diagram understanding
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Analyze this chart. What trends do you see? Provide the data points if visible."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/sales-chart-q4.png",
"detail": "high"
}
}
]
}
],
max_tokens=500
)

print(f"\nChart analysis: {response.choices[0].message.content}")

Task 4: Compare Image Analysis Approaches

import os
import time
from openai import AzureOpenAI

client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)

image_url = "https://example.com/sample-architecture.png"

# Low detail: Fixed cost (85 tokens), faster, less accurate
start = time.time()
response_low = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": image_url, "detail": "low"}}
]
}],
max_tokens=300
)
time_low = time.time() - start

# High detail: Variable cost (based on image size), slower, more accurate
start = time.time()
response_high = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": image_url, "detail": "high"}}
]
}],
max_tokens=300
)
time_high = time.time() - start

print("=== Low Detail ===")
print(f"Time: {time_low:.2f}s | Tokens: {response_low.usage.total_tokens}")
print(f"Response: {response_low.choices[0].message.content[:200]}...")

print("\n=== High Detail ===")
print(f"Time: {time_high:.2f}s | Tokens: {response_high.usage.total_tokens}")
print(f"Response: {response_high.choices[0].message.content[:200]}...")

print("\n=== Cost Comparison ===")
print(f"Low detail always uses 85 image tokens (fixed)")
print(f"High detail uses {response_high.usage.prompt_tokens - response_low.usage.prompt_tokens} additional image tokens")

Expected Output

Image URL: https://dalleproduse.blob.core.windows.net/...
Revised prompt: A highly detailed modern cloud data center rendered in isometric 3D...

HD Image URL: https://dalleproduse.blob.core.windows.net/...
Revised prompt: A professional studio headshot photograph of a friendly humanoid robot...

Analysis: The image shows a web-queue-worker architecture pattern with a web frontend
connected to a message queue, which feeds into a background worker process...

Extracted text:
INVOICE #12345
Date: 2024-03-15
Customer: Contoso Ltd.
...

=== Low Detail ===
Time: 1.2s | Tokens: 198
=== High Detail ===
Time: 2.8s | Tokens: 1542

Break & fix

ScenarioSymptomRoot CauseFix
DALL-E returns content filter errorContentFilterErrorPrompt triggered safety filtersRephrase prompt; avoid potentially sensitive content
Image URL returns 400Invalid image URLURL is not publicly accessibleUse base64 encoding for private images
Vision returns blurry analysisInaccurate descriptionsUsing detail: "low" on complex imagesSwitch to detail: "high" for detailed analysis
Token count unexpectedly highLarge bill for vision requestsHigh-res images with detail: "high"Use detail: "low" when full resolution isn't needed
DALL-E n>1 failsInvalidRequestErrorDALL-E 3 only supports n=1Set n=1; make multiple requests for multiple images

Knowledge Check

1. What image sizes does DALL-E 3 support in Azure OpenAI?

2. How are images provided to GPT-4o for vision analysis?

3. What does the 'detail' parameter control when sending images to GPT-4o?

4. What is the 'revised_prompt' field in a DALL-E 3 response?

5. How long are DALL-E 3 generated image URLs valid before they expire?

Cleanup

az group delete --name rg-ai102-challenge18 --yes --no-wait

Learn More