Challenge 18: DALL-E and Multimodal Models

Estimated Time

45-60 min | Cost: ~$2.00 (estimated, image generation) | Domain: Generative AI Solutions (15-20%)

Exam skills covered

Use DALL-E to generate images
Use large multimodal models (GPT-4o vision capabilities)

Overview

Azure OpenAI provides access to multimodal AI capabilities through two primary features: DALL-E 3 for image generation and GPT-4o for vision understanding. DALL-E 3 generates images from text descriptions, supporting sizes of 1024×1024, 1024×1792, and 1792×1024 with configurable quality settings (standard or HD). Each generation request produces a unique image with a temporary URL valid for 24 hours.

GPT-4o's vision capabilities allow the model to analyze images provided either as URLs or base64-encoded data. The model can describe image content, extract text (OCR), interpret charts and diagrams, compare multiple images, and answer questions about visual content. Images are processed as special content parts within the chat completions API, maintaining the familiar message structure.

When working with multimodal inputs, understanding token costs is important: image analysis costs vary by resolution. The detail parameter controls processing: low uses a fixed 85 tokens regardless of size, while high processes the image at full resolution with costs proportional to the number of 512×512 tiles needed to cover the image.

Architecture

This challenge generates images with DALL-E 3, analyzes images with GPT-4o vision, and explores OCR and chart understanding capabilities.

Challenge 18 topology

Prerequisites

Azure OpenAI resource with DALL-E 3 model deployed (deployment name: dall-e-3)
Azure OpenAI resource with GPT-4o model deployed (deployment name: gpt-4o)
Python 3.9+ with openai package installed
.NET 8 SDK with Azure.AI.OpenAI NuGet package
Sample images for vision analysis (URLs or local files)

Implementation

Task 1: Generate Images with DALL-E 3

Python SDK
C# SDK
REST API

import os
from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-10-21"
)

# Generate an image with DALL-E 3
result = client.images.generate(
    model="dall-e-3",  # deployment name
    prompt="A modern cloud data center with glowing blue network connections, "
           "isometric 3D illustration style, clean white background",
    size="1024x1024",       # Options: 1024x1024, 1024x1792, 1792x1024
    quality="standard",     # Options: standard, hd
    style="vivid",          # Options: vivid, natural
    n=1                     # DALL-E 3 only supports n=1
)

image_url = result.data[0].url
revised_prompt = result.data[0].revised_prompt

print(f"Image URL: {image_url}")
print(f"Revised prompt: {revised_prompt}")

# Generate HD quality portrait image
result_hd = client.images.generate(
    model="dall-e-3",
    prompt="Professional headshot photo of a friendly AI robot assistant, "
           "soft studio lighting, shallow depth of field",
    size="1024x1792",   # Portrait orientation
    quality="hd",       # Higher detail
    style="natural",    # More photorealistic
    n=1
)

print(f"\nHD Image URL: {result_hd.data[0].url}")
print(f"Revised prompt: {result_hd.data[0].revised_prompt}")

using Azure;
using Azure.AI.OpenAI;
using OpenAI.Images;

string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;

AzureOpenAIClient azureClient = new(
    new Uri(endpoint),
    new AzureKeyCredential(apiKey));

ImageClient imageClient = azureClient.GetImageClient("dall-e-3");

// Generate an image with DALL-E 3
GeneratedImage image = await imageClient.GenerateImageAsync(
    "A modern cloud data center with glowing blue network connections, "
    + "isometric 3D illustration style, clean white background",
    new ImageGenerationOptions
    {
        Size = GeneratedImageSize.W1024xH1024,
        Quality = GeneratedImageQuality.Standard,
        Style = GeneratedImageStyle.Vivid
    });

Console.WriteLine($"Image URL: {image.ImageUri}");
Console.WriteLine($"Revised prompt: {image.RevisedPrompt}");

// Generate HD quality portrait image
GeneratedImage imageHd = await imageClient.GenerateImageAsync(
    "Professional headshot photo of a friendly AI robot assistant, "
    + "soft studio lighting, shallow depth of field",
    new ImageGenerationOptions
    {
        Size = GeneratedImageSize.W1024xH1792,
        Quality = GeneratedImageQuality.High,
        Style = GeneratedImageStyle.Natural
    });

Console.WriteLine($"\nHD Image URL: {imageHd.ImageUri}");
Console.WriteLine($"Revised prompt: {imageHd.RevisedPrompt}");

# Generate image with DALL-E 3
curl -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/dall-e-3/images/generations?api-version=2024-10-21" \
  -H "Content-Type: application/json" \
  -H "api-key: ${AZURE_OPENAI_KEY}" \
  -d '{
    "prompt": "A modern cloud data center with glowing blue network connections, isometric 3D illustration style, clean white background",
    "size": "1024x1024",
    "quality": "standard",
    "style": "vivid",
    "n": 1
  }'

# Generate HD portrait image
curl -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/dall-e-3/images/generations?api-version=2024-10-21" \
  -H "Content-Type: application/json" \
  -H "api-key: ${AZURE_OPENAI_KEY}" \
  -d '{
    "prompt": "Professional headshot photo of a friendly AI robot assistant, soft studio lighting, shallow depth of field",
    "size": "1024x1792",
    "quality": "hd",
    "style": "natural",
    "n": 1
  }'

# Response includes:
# - data[0].url: Temporary URL (valid 24 hours)
# - data[0].revised_prompt: DALL-E 3's enhanced version of your prompt

Task 2: Analyze Images with GPT-4o Vision (URL Input)

Python SDK
C# SDK
REST API

import os
from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-10-21"
)

# Analyze an image using a URL
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You are an image analysis assistant. Describe images accurately and concisely."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in detail. What architecture components do you see?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://learn.microsoft.com/en-us/azure/architecture/guide/images/a]]rchitecture-styles/web-queue-worker-logical.svg",
                        "detail": "high"  # Options: low, high, auto
                    }
                }
            ]
        }
    ],
    max_tokens=500
)

print(f"Analysis: {response.choices[0].message.content}")
print(f"Tokens used: {response.usage.total_tokens}")

# Compare multiple images
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare these two architecture diagrams. What are the key differences?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/architecture-v1.png",
                        "detail": "high"
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/architecture-v2.png",
                        "detail": "high"
                    }
                }
            ]
        }
    ],
    max_tokens=500
)

print(f"\nComparison: {response.choices[0].message.content}")

using Azure;
using Azure.AI.OpenAI;
using OpenAI.Chat;

string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;

AzureOpenAIClient azureClient = new(
    new Uri(endpoint),
    new AzureKeyCredential(apiKey));

ChatClient chatClient = azureClient.GetChatClient("gpt-4o");

// Analyze an image using a URL
var messages = new ChatMessage[]
{
    new SystemChatMessage("You are an image analysis assistant. Describe images accurately and concisely."),
    new UserChatMessage(
        ChatMessageContentPart.CreateTextPart("Describe this image in detail. What architecture components do you see?"),
        ChatMessageContentPart.CreateImagePart(
            new Uri("https://learn.microsoft.com/en-us/azure/architecture/guide/images/architecture-styles/web-queue-worker-logical.svg"),
            ImageChatMessageContentPartDetail.High))
};

ChatCompletion result = await chatClient.CompleteChatAsync(
    messages,
    new ChatCompletionOptions { MaxOutputTokenCount = 500 });

Console.WriteLine($"Analysis: {result.Content[0].Text}");
Console.WriteLine($"Tokens used: {result.Usage.TotalTokenCount}");

// Compare multiple images
var compareMessages = new ChatMessage[]
{
    new UserChatMessage(
        ChatMessageContentPart.CreateTextPart("Compare these two architecture diagrams. What are the key differences?"),
        ChatMessageContentPart.CreateImagePart(
            new Uri("https://example.com/architecture-v1.png"),
            ImageChatMessageContentPartDetail.High),
        ChatMessageContentPart.CreateImagePart(
            new Uri("https://example.com/architecture-v2.png"),
            ImageChatMessageContentPartDetail.High))
};

ChatCompletion compareResult = await chatClient.CompleteChatAsync(
    compareMessages,
    new ChatCompletionOptions { MaxOutputTokenCount = 500 });

Console.WriteLine($"\nComparison: {compareResult.Content[0].Text}");

# Analyze image with GPT-4o vision (URL input)
curl -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
  -H "Content-Type: application/json" \
  -H "api-key: ${AZURE_OPENAI_KEY}" \
  -d '{
    "messages": [
      {
        "role": "system",
        "content": "You are an image analysis assistant."
      },
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe this architecture diagram in detail."},
          {
            "type": "image_url",
            "image_url": {
              "url": "https://learn.microsoft.com/en-us/azure/architecture/guide/images/architecture-styles/web-queue-worker-logical.svg",
              "detail": "high"
            }
          }
        ]
      }
    ],
    "max_tokens": 500
  }'

# Compare multiple images
curl -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
  -H "Content-Type: application/json" \
  -H "api-key: ${AZURE_OPENAI_KEY}" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Compare these two images."},
          {"type": "image_url", "image_url": {"url": "https://example.com/img1.png", "detail": "high"}},
          {"type": "image_url", "image_url": {"url": "https://example.com/img2.png", "detail": "high"}}
        ]
      }
    ],
    "max_tokens": 500
  }'

Task 3: Extract Text from Images (OCR with Vision)

Python SDK
C# SDK
REST API

import os
import base64
from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-10-21"
)

def encode_image_to_base64(image_path: str) -> str:
    """Encode a local image file to base64."""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# Option 1: Base64-encoded local image
image_base64 = encode_image_to_base64("sample-document.png")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You are an OCR assistant. Extract all visible text from the image exactly as written. Preserve formatting where possible."
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract all text from this document image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_base64}",
                        "detail": "high"
                    }
                }
            ]
        }
    ],
    max_tokens=1000
)

print("Extracted text:")
print(response.choices[0].message.content)

# Option 2: Chart/diagram understanding
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Analyze this chart. What trends do you see? Provide the data points if visible."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/sales-chart-q4.png",
                        "detail": "high"
                    }
                }
            ]
        }
    ],
    max_tokens=500
)

print(f"\nChart analysis: {response.choices[0].message.content}")

using Azure;
using Azure.AI.OpenAI;
using OpenAI.Chat;

string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;

AzureOpenAIClient azureClient = new(
    new Uri(endpoint),
    new AzureKeyCredential(apiKey));

ChatClient chatClient = azureClient.GetChatClient("gpt-4o");

// Base64-encoded local image for OCR
byte[] imageBytes = await File.ReadAllBytesAsync("sample-document.png");
BinaryData imageData = BinaryData.FromBytes(imageBytes);

var ocrMessages = new ChatMessage[]
{
    new SystemChatMessage("You are an OCR assistant. Extract all visible text exactly as written."),
    new UserChatMessage(
        ChatMessageContentPart.CreateTextPart("Extract all text from this document image."),
        ChatMessageContentPart.CreateImagePart(imageData, "image/png",
            ImageChatMessageContentPartDetail.High))
};

ChatCompletion ocrResult = await chatClient.CompleteChatAsync(
    ocrMessages,
    new ChatCompletionOptions { MaxOutputTokenCount = 1000 });

Console.WriteLine("Extracted text:");
Console.WriteLine(ocrResult.Content[0].Text);

// Chart understanding
var chartMessages = new ChatMessage[]
{
    new UserChatMessage(
        ChatMessageContentPart.CreateTextPart("Analyze this chart. What trends do you see?"),
        ChatMessageContentPart.CreateImagePart(
            new Uri("https://example.com/sales-chart-q4.png"),
            ImageChatMessageContentPartDetail.High))
};

ChatCompletion chartResult = await chatClient.CompleteChatAsync(
    chartMessages,
    new ChatCompletionOptions { MaxOutputTokenCount = 500 });

Console.WriteLine($"\nChart analysis: {chartResult.Content[0].Text}");

# Base64 image input for OCR
# First, encode image to base64
IMAGE_BASE64=$(base64 -w 0 sample-document.png)

curl -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
  -H "Content-Type: application/json" \
  -H "api-key: ${AZURE_OPENAI_KEY}" \
  -d "{
    \"messages\": [
      {\"role\": \"system\", \"content\": \"You are an OCR assistant. Extract all visible text.\"},
      {
        \"role\": \"user\",
        \"content\": [
          {\"type\": \"text\", \"text\": \"Extract all text from this document.\"},
          {
            \"type\": \"image_url\",
            \"image_url\": {
              \"url\": \"data:image/png;base64,${IMAGE_BASE64}\",
              \"detail\": \"high\"
            }
          }
        ]
      }
    ],
    \"max_tokens\": 1000
  }"

# Chart analysis with URL
curl -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
  -H "Content-Type: application/json" \
  -H "api-key: ${AZURE_OPENAI_KEY}" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Analyze this chart and describe the trends."},
          {"type": "image_url", "image_url": {"url": "https://example.com/chart.png", "detail": "high"}}
        ]
      }
    ],
    "max_tokens": 500
  }'

Task 4: Compare Image Analysis Approaches

Python SDK
C# SDK
REST API

import os
import time
from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-10-21"
)

image_url = "https://example.com/sample-architecture.png"

# Low detail: Fixed cost (85 tokens), faster, less accurate
start = time.time()
response_low = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image."},
            {"type": "image_url", "image_url": {"url": image_url, "detail": "low"}}
        ]
    }],
    max_tokens=300
)
time_low = time.time() - start

# High detail: Variable cost (based on image size), slower, more accurate
start = time.time()
response_high = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image."},
            {"type": "image_url", "image_url": {"url": image_url, "detail": "high"}}
        ]
    }],
    max_tokens=300
)
time_high = time.time() - start

print("=== Low Detail ===")
print(f"Time: {time_low:.2f}s | Tokens: {response_low.usage.total_tokens}")
print(f"Response: {response_low.choices[0].message.content[:200]}...")

print("\n=== High Detail ===")
print(f"Time: {time_high:.2f}s | Tokens: {response_high.usage.total_tokens}")
print(f"Response: {response_high.choices[0].message.content[:200]}...")

print("\n=== Cost Comparison ===")
print(f"Low detail always uses 85 image tokens (fixed)")
print(f"High detail uses {response_high.usage.prompt_tokens - response_low.usage.prompt_tokens} additional image tokens")

using Azure;
using Azure.AI.OpenAI;
using OpenAI.Chat;
using System.Diagnostics;

string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;

AzureOpenAIClient azureClient = new(
    new Uri(endpoint),
    new AzureKeyCredential(apiKey));

ChatClient chatClient = azureClient.GetChatClient("gpt-4o");

var imageUri = new Uri("https://example.com/sample-architecture.png");

// Low detail comparison
var sw = Stopwatch.StartNew();
var lowResult = await chatClient.CompleteChatAsync(
    new ChatMessage[] {
        new UserChatMessage(
            ChatMessageContentPart.CreateTextPart("Describe this image."),
            ChatMessageContentPart.CreateImagePart(imageUri, ImageChatMessageContentPartDetail.Low))
    },
    new ChatCompletionOptions { MaxOutputTokenCount = 300 });
sw.Stop();
var timeLow = sw.Elapsed;

// High detail comparison
sw.Restart();
var highResult = await chatClient.CompleteChatAsync(
    new ChatMessage[] {
        new UserChatMessage(
            ChatMessageContentPart.CreateTextPart("Describe this image."),
            ChatMessageContentPart.CreateImagePart(imageUri, ImageChatMessageContentPartDetail.High))
    },
    new ChatCompletionOptions { MaxOutputTokenCount = 300 });
sw.Stop();
var timeHigh = sw.Elapsed;

Console.WriteLine($"Low Detail: {timeLow.TotalSeconds:F2}s | Tokens: {lowResult.Value.Usage.TotalTokenCount}");
Console.WriteLine($"High Detail: {timeHigh.TotalSeconds:F2}s | Tokens: {highResult.Value.Usage.TotalTokenCount}");
Console.WriteLine($"\nLow detail always uses 85 image tokens (fixed cost)");

# Low detail (85 tokens, fast)
echo "=== Low Detail ==="
time curl -s -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
  -H "Content-Type: application/json" \
  -H "api-key: ${AZURE_OPENAI_KEY}" \
  -d '{
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image."},
        {"type": "image_url", "image_url": {"url": "https://example.com/architecture.png", "detail": "low"}}
      ]
    }],
    "max_tokens": 300
  }' | jq '.usage'

# High detail (variable tokens, more accurate)
echo "=== High Detail ==="
time curl -s -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
  -H "Content-Type: application/json" \
  -H "api-key: ${AZURE_OPENAI_KEY}" \
  -d '{
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image."},
        {"type": "image_url", "image_url": {"url": "https://example.com/architecture.png", "detail": "high"}}
      ]
    }],
    "max_tokens": 300
  }' | jq '.usage'

Expected Output

Image URL: https://dalleproduse.blob.core.windows.net/...
Revised prompt: A highly detailed modern cloud data center rendered in isometric 3D...

HD Image URL: https://dalleproduse.blob.core.windows.net/...
Revised prompt: A professional studio headshot photograph of a friendly humanoid robot...

Analysis: The image shows a web-queue-worker architecture pattern with a web frontend
connected to a message queue, which feeds into a background worker process...

Extracted text:
INVOICE #12345
Date: 2024-03-15
Customer: Contoso Ltd.
...

=== Low Detail ===
Time: 1.2s | Tokens: 198
=== High Detail ===
Time: 2.8s | Tokens: 1542

Break & fix

Scenario	Symptom	Root Cause	Fix
DALL-E returns content filter error	`ContentFilterError`	Prompt triggered safety filters	Rephrase prompt; avoid potentially sensitive content
Image URL returns 400	`Invalid image URL`	URL is not publicly accessible	Use base64 encoding for private images
Vision returns blurry analysis	Inaccurate descriptions	Using `detail: "low"` on complex images	Switch to `detail: "high"` for detailed analysis
Token count unexpectedly high	Large bill for vision requests	High-res images with `detail: "high"`	Use `detail: "low"` when full resolution isn't needed
DALL-E n>1 fails	`InvalidRequestError`	DALL-E 3 only supports n=1	Set n=1; make multiple requests for multiple images

Knowledge Check

1. What image sizes does DALL-E 3 support in Azure OpenAI?

2. How are images provided to GPT-4o for vision analysis?

3. What does the 'detail' parameter control when sending images to GPT-4o?

4. What is the 'revised_prompt' field in a DALL-E 3 response?

5. How long are DALL-E 3 generated image URLs valid before they expire?

Cleanup

az group delete --name rg-ai102-challenge18 --yes --no-wait

Exam skills covered​

Overview​

Architecture​

Prerequisites​

Implementation​

Task 1: Generate Images with DALL-E 3​

Task 2: Analyze Images with GPT-4o Vision (URL Input)​

Task 3: Extract Text from Images (OCR with Vision)​

Task 4: Compare Image Analysis Approaches​

Expected Output​

Break & fix​

Knowledge Check​

Cleanup​

Learn More​