Desafio 18: DALL-E e Modelos Multimodais
45-60 min | Custo: ~$2,00 (estimado, geração de imagens) | Domínio: Soluções de IA Generativa (15-20%)
Habilidades do exame cobertas
- Usar DALL-E para gerar imagens
- Usar modelos multimodais grandes (capacidades de visão do GPT-4o)
Visão Geral
O Azure OpenAI fornece acesso a capacidades de IA multimodal através de dois recursos principais: DALL-E 3 para geração de imagens e GPT-4o para compreensão visual. O DALL-E 3 gera imagens a partir de descrições textuais, suportando tamanhos de 1024×1024, 1024×1792 e 1792×1024 com configurações de qualidade ajustáveis (standard ou HD). Cada requisição de geração produz uma imagem única com uma URL temporária válida por 24 horas.
As capacidades de visão do GPT-4o permitem que o modelo analise imagens fornecidas como URLs ou dados codificados em base64. O modelo pode descrever conteúdo de imagens, extrair texto (OCR), interpretar gráficos e diagramas, comparar múltiplas imagens e responder perguntas sobre conteúdo visual. As imagens são processadas como partes de conteúdo especiais dentro da API de chat completions, mantendo a estrutura de mensagens familiar.
Ao trabalhar com entradas multimodais, entender os custos de tokens é importante: os custos de análise de imagens variam por resolução. O parâmetro detail controla o processamento: low usa 85 tokens fixos independente do tamanho, enquanto high processa a imagem em resolução completa com custos proporcionais ao número de blocos 512×512 necessários para cobrir a imagem.
Arquitetura
Este desafio gera imagens com DALL-E 3, analisa imagens com a visão do GPT-4o e explora capacidades de OCR e compreensão de gráficos.
Pré-requisitos
- Recurso Azure OpenAI com modelo DALL-E 3 implantado (nome da implantação:
dall-e-3) - Recurso Azure OpenAI com modelo GPT-4o implantado (nome da implantação:
gpt-4o) - Python 3.9+ com pacote
openaiinstalado - .NET 8 SDK com pacote NuGet
Azure.AI.OpenAI - Imagens de exemplo para análise de visão (URLs ou arquivos locais)
Implementação
Tarefa 1: Gerar Imagens com DALL-E 3
- Python SDK
- C# SDK
- REST API
import os
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)
# Generate an image with DALL-E 3
result = client.images.generate(
model="dall-e-3", # deployment name
prompt="A modern cloud data center with glowing blue network connections, "
"isometric 3D illustration style, clean white background",
size="1024x1024", # Options: 1024x1024, 1024x1792, 1792x1024
quality="standard", # Options: standard, hd
style="vivid", # Options: vivid, natural
n=1 # DALL-E 3 only supports n=1
)
image_url = result.data[0].url
revised_prompt = result.data[0].revised_prompt
print(f"Image URL: {image_url}")
print(f"Revised prompt: {revised_prompt}")
# Generate HD quality portrait image
result_hd = client.images.generate(
model="dall-e-3",
prompt="Professional headshot photo of a friendly AI robot assistant, "
"soft studio lighting, shallow depth of field",
size="1024x1792", # Portrait orientation
quality="hd", # Higher detail
style="natural", # More photorealistic
n=1
)
print(f"\nHD Image URL: {result_hd.data[0].url}")
print(f"Revised prompt: {result_hd.data[0].revised_prompt}")
using Azure;
using Azure.AI.OpenAI;
using OpenAI.Images;
string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;
AzureOpenAIClient azureClient = new(
new Uri(endpoint),
new AzureKeyCredential(apiKey));
ImageClient imageClient = azureClient.GetImageClient("dall-e-3");
// Generate an image with DALL-E 3
GeneratedImage image = await imageClient.GenerateImageAsync(
"A modern cloud data center with glowing blue network connections, "
+ "isometric 3D illustration style, clean white background",
new ImageGenerationOptions
{
Size = GeneratedImageSize.W1024xH1024,
Quality = GeneratedImageQuality.Standard,
Style = GeneratedImageStyle.Vivid
});
Console.WriteLine($"Image URL: {image.ImageUri}");
Console.WriteLine($"Revised prompt: {image.RevisedPrompt}");
// Generate HD quality portrait image
GeneratedImage imageHd = await imageClient.GenerateImageAsync(
"Professional headshot photo of a friendly AI robot assistant, "
+ "soft studio lighting, shallow depth of field",
new ImageGenerationOptions
{
Size = GeneratedImageSize.W1024xH1792,
Quality = GeneratedImageQuality.High,
Style = GeneratedImageStyle.Natural
});
Console.WriteLine($"\nHD Image URL: {imageHd.ImageUri}");
Console.WriteLine($"Revised prompt: {imageHd.RevisedPrompt}");
# Generate image with DALL-E 3
curl -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/dall-e-3/images/generations?api-version=2024-10-21" \
-H "Content-Type: application/json" \
-H "api-key: ${AZURE_OPENAI_KEY}" \
-d '{
"prompt": "A modern cloud data center with glowing blue network connections, isometric 3D illustration style, clean white background",
"size": "1024x1024",
"quality": "standard",
"style": "vivid",
"n": 1
}'
# Generate HD portrait image
curl -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/dall-e-3/images/generations?api-version=2024-10-21" \
-H "Content-Type: application/json" \
-H "api-key: ${AZURE_OPENAI_KEY}" \
-d '{
"prompt": "Professional headshot photo of a friendly AI robot assistant, soft studio lighting, shallow depth of field",
"size": "1024x1792",
"quality": "hd",
"style": "natural",
"n": 1
}'
# Response includes:
# - data[0].url: Temporary URL (valid 24 hours)
# - data[0].revised_prompt: DALL-E 3's enhanced version of your prompt
Tarefa 2: Analisar Imagens com Visão do GPT-4o (Entrada por URL)
- Python SDK
- C# SDK
- REST API
import os
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)
# Analyze an image using a URL
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are an image analysis assistant. Describe images accurately and concisely."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in detail. What architecture components do you see?"
},
{
"type": "image_url",
"image_url": {
"url": "https://learn.microsoft.com/en-us/azure/architecture/guide/images/a]]rchitecture-styles/web-queue-worker-logical.svg",
"detail": "high" # Options: low, high, auto
}
}
]
}
],
max_tokens=500
)
print(f"Analysis: {response.choices[0].message.content}")
print(f"Tokens used: {response.usage.total_tokens}")
# Compare multiple images
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Compare these two architecture diagrams. What are the key differences?"},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/architecture-v1.png",
"detail": "high"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/architecture-v2.png",
"detail": "high"
}
}
]
}
],
max_tokens=500
)
print(f"\nComparison: {response.choices[0].message.content}")
using Azure;
using Azure.AI.OpenAI;
using OpenAI.Chat;
string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;
AzureOpenAIClient azureClient = new(
new Uri(endpoint),
new AzureKeyCredential(apiKey));
ChatClient chatClient = azureClient.GetChatClient("gpt-4o");
// Analyze an image using a URL
var messages = new ChatMessage[]
{
new SystemChatMessage("You are an image analysis assistant. Describe images accurately and concisely."),
new UserChatMessage(
ChatMessageContentPart.CreateTextPart("Describe this image in detail. What architecture components do you see?"),
ChatMessageContentPart.CreateImagePart(
new Uri("https://learn.microsoft.com/en-us/azure/architecture/guide/images/architecture-styles/web-queue-worker-logical.svg"),
ImageChatMessageContentPartDetail.High))
};
ChatCompletion result = await chatClient.CompleteChatAsync(
messages,
new ChatCompletionOptions { MaxOutputTokenCount = 500 });
Console.WriteLine($"Analysis: {result.Content[0].Text}");
Console.WriteLine($"Tokens used: {result.Usage.TotalTokenCount}");
// Compare multiple images
var compareMessages = new ChatMessage[]
{
new UserChatMessage(
ChatMessageContentPart.CreateTextPart("Compare these two architecture diagrams. What are the key differences?"),
ChatMessageContentPart.CreateImagePart(
new Uri("https://example.com/architecture-v1.png"),
ImageChatMessageContentPartDetail.High),
ChatMessageContentPart.CreateImagePart(
new Uri("https://example.com/architecture-v2.png"),
ImageChatMessageContentPartDetail.High))
};
ChatCompletion compareResult = await chatClient.CompleteChatAsync(
compareMessages,
new ChatCompletionOptions { MaxOutputTokenCount = 500 });
Console.WriteLine($"\nComparison: {compareResult.Content[0].Text}");
# Analyze image with GPT-4o vision (URL input)
curl -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
-H "Content-Type: application/json" \
-H "api-key: ${AZURE_OPENAI_KEY}" \
-d '{
"messages": [
{
"role": "system",
"content": "You are an image analysis assistant."
},
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this architecture diagram in detail."},
{
"type": "image_url",
"image_url": {
"url": "https://learn.microsoft.com/en-us/azure/architecture/guide/images/architecture-styles/web-queue-worker-logical.svg",
"detail": "high"
}
}
]
}
],
"max_tokens": 500
}'
# Compare multiple images
curl -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
-H "Content-Type: application/json" \
-H "api-key: ${AZURE_OPENAI_KEY}" \
-d '{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Compare these two images."},
{"type": "image_url", "image_url": {"url": "https://example.com/img1.png", "detail": "high"}},
{"type": "image_url", "image_url": {"url": "https://example.com/img2.png", "detail": "high"}}
]
}
],
"max_tokens": 500
}'
Tarefa 3: Extrair Texto de Imagens (OCR com Visão)
- Python SDK
- C# SDK
- REST API
import os
import base64
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)
def encode_image_to_base64(image_path: str) -> str:
"""Encode a local image file to base64."""
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# Option 1: Base64-encoded local image
image_base64 = encode_image_to_base64("sample-document.png")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are an OCR assistant. Extract all visible text from the image exactly as written. Preserve formatting where possible."
},
{
"role": "user",
"content": [
{"type": "text", "text": "Extract all text from this document image."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_base64}",
"detail": "high"
}
}
]
}
],
max_tokens=1000
)
print("Extracted text:")
print(response.choices[0].message.content)
# Option 2: Chart/diagram understanding
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Analyze this chart. What trends do you see? Provide the data points if visible."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/sales-chart-q4.png",
"detail": "high"
}
}
]
}
],
max_tokens=500
)
print(f"\nChart analysis: {response.choices[0].message.content}")
using Azure;
using Azure.AI.OpenAI;
using OpenAI.Chat;
string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;
AzureOpenAIClient azureClient = new(
new Uri(endpoint),
new AzureKeyCredential(apiKey));
ChatClient chatClient = azureClient.GetChatClient("gpt-4o");
// Base64-encoded local image for OCR
byte[] imageBytes = await File.ReadAllBytesAsync("sample-document.png");
BinaryData imageData = BinaryData.FromBytes(imageBytes);
var ocrMessages = new ChatMessage[]
{
new SystemChatMessage("You are an OCR assistant. Extract all visible text exactly as written."),
new UserChatMessage(
ChatMessageContentPart.CreateTextPart("Extract all text from this document image."),
ChatMessageContentPart.CreateImagePart(imageData, "image/png",
ImageChatMessageContentPartDetail.High))
};
ChatCompletion ocrResult = await chatClient.CompleteChatAsync(
ocrMessages,
new ChatCompletionOptions { MaxOutputTokenCount = 1000 });
Console.WriteLine("Extracted text:");
Console.WriteLine(ocrResult.Content[0].Text);
// Chart understanding
var chartMessages = new ChatMessage[]
{
new UserChatMessage(
ChatMessageContentPart.CreateTextPart("Analyze this chart. What trends do you see?"),
ChatMessageContentPart.CreateImagePart(
new Uri("https://example.com/sales-chart-q4.png"),
ImageChatMessageContentPartDetail.High))
};
ChatCompletion chartResult = await chatClient.CompleteChatAsync(
chartMessages,
new ChatCompletionOptions { MaxOutputTokenCount = 500 });
Console.WriteLine($"\nChart analysis: {chartResult.Content[0].Text}");
# Base64 image input for OCR
# First, encode image to base64
IMAGE_BASE64=$(base64 -w 0 sample-document.png)
curl -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
-H "Content-Type: application/json" \
-H "api-key: ${AZURE_OPENAI_KEY}" \
-d "{
\"messages\": [
{\"role\": \"system\", \"content\": \"You are an OCR assistant. Extract all visible text.\"},
{
\"role\": \"user\",
\"content\": [
{\"type\": \"text\", \"text\": \"Extract all text from this document.\"},
{
\"type\": \"image_url\",
\"image_url\": {
\"url\": \"data:image/png;base64,${IMAGE_BASE64}\",
\"detail\": \"high\"
}
}
]
}
],
\"max_tokens\": 1000
}"
# Chart analysis with URL
curl -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
-H "Content-Type: application/json" \
-H "api-key: ${AZURE_OPENAI_KEY}" \
-d '{
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Analyze this chart and describe the trends."},
{"type": "image_url", "image_url": {"url": "https://example.com/chart.png", "detail": "high"}}
]
}
],
"max_tokens": 500
}'
Tarefa 4: Comparar Abordagens de Análise de Imagem
- Python SDK
- C# SDK
- REST API
import os
import time
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-10-21"
)
image_url = "https://example.com/sample-architecture.png"
# Low detail: Fixed cost (85 tokens), faster, less accurate
start = time.time()
response_low = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": image_url, "detail": "low"}}
]
}],
max_tokens=300
)
time_low = time.time() - start
# High detail: Variable cost (based on image size), slower, more accurate
start = time.time()
response_high = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": image_url, "detail": "high"}}
]
}],
max_tokens=300
)
time_high = time.time() - start
print("=== Low Detail ===")
print(f"Time: {time_low:.2f}s | Tokens: {response_low.usage.total_tokens}")
print(f"Response: {response_low.choices[0].message.content[:200]}...")
print("\n=== High Detail ===")
print(f"Time: {time_high:.2f}s | Tokens: {response_high.usage.total_tokens}")
print(f"Response: {response_high.choices[0].message.content[:200]}...")
print("\n=== Cost Comparison ===")
print(f"Low detail always uses 85 image tokens (fixed)")
print(f"High detail uses {response_high.usage.prompt_tokens - response_low.usage.prompt_tokens} additional image tokens")
using Azure;
using Azure.AI.OpenAI;
using OpenAI.Chat;
using System.Diagnostics;
string endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!;
AzureOpenAIClient azureClient = new(
new Uri(endpoint),
new AzureKeyCredential(apiKey));
ChatClient chatClient = azureClient.GetChatClient("gpt-4o");
var imageUri = new Uri("https://example.com/sample-architecture.png");
// Low detail comparison
var sw = Stopwatch.StartNew();
var lowResult = await chatClient.CompleteChatAsync(
new ChatMessage[] {
new UserChatMessage(
ChatMessageContentPart.CreateTextPart("Describe this image."),
ChatMessageContentPart.CreateImagePart(imageUri, ImageChatMessageContentPartDetail.Low))
},
new ChatCompletionOptions { MaxOutputTokenCount = 300 });
sw.Stop();
var timeLow = sw.Elapsed;
// High detail comparison
sw.Restart();
var highResult = await chatClient.CompleteChatAsync(
new ChatMessage[] {
new UserChatMessage(
ChatMessageContentPart.CreateTextPart("Describe this image."),
ChatMessageContentPart.CreateImagePart(imageUri, ImageChatMessageContentPartDetail.High))
},
new ChatCompletionOptions { MaxOutputTokenCount = 300 });
sw.Stop();
var timeHigh = sw.Elapsed;
Console.WriteLine($"Low Detail: {timeLow.TotalSeconds:F2}s | Tokens: {lowResult.Value.Usage.TotalTokenCount}");
Console.WriteLine($"High Detail: {timeHigh.TotalSeconds:F2}s | Tokens: {highResult.Value.Usage.TotalTokenCount}");
Console.WriteLine($"\nLow detail always uses 85 image tokens (fixed cost)");
# Low detail (85 tokens, fast)
echo "=== Low Detail ==="
time curl -s -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
-H "Content-Type: application/json" \
-H "api-key: ${AZURE_OPENAI_KEY}" \
-d '{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/architecture.png", "detail": "low"}}
]
}],
"max_tokens": 300
}' | jq '.usage'
# High detail (variable tokens, more accurate)
echo "=== High Detail ==="
time curl -s -X POST "https://${AZURE_OPENAI_ENDPOINT}/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
-H "Content-Type: application/json" \
-H "api-key: ${AZURE_OPENAI_KEY}" \
-d '{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/architecture.png", "detail": "high"}}
]
}],
"max_tokens": 300
}' | jq '.usage'
Saída Esperada
Image URL: https://dalleproduse.blob.core.windows.net/...
Revised prompt: A highly detailed modern cloud data center rendered in isometric 3D...
HD Image URL: https://dalleproduse.blob.core.windows.net/...
Revised prompt: A professional studio headshot photograph of a friendly humanoid robot...
Analysis: The image shows a web-queue-worker architecture pattern with a web frontend
connected to a message queue, which feeds into a background worker process...
Extracted text:
INVOICE #12345
Date: 2024-03-15
Customer: Contoso Ltd.
...
=== Low Detail ===
Time: 1.2s | Tokens: 198
=== High Detail ===
Time: 2.8s | Tokens: 1542
Quebra & conserta
| Cenário | Sintoma | Causa Raiz | Correção |
|---|---|---|---|
| DALL-E retorna erro de filtro de conteúdo | ContentFilterError | Prompt acionou filtros de segurança | Reformule o prompt; evite conteúdo potencialmente sensível |
| URL da imagem retorna 400 | Invalid image URL | URL não é publicamente acessível | Use codificação base64 para imagens privadas |
| Visão retorna análise imprecisa | Descrições incorretas | Usando detail: "low" em imagens complexas | Mude para detail: "high" para análise detalhada |
| Contagem de tokens inesperadamente alta | Conta elevada para requisições de visão | Imagens de alta resolução com detail: "high" | Use detail: "low" quando resolução completa não é necessária |
| DALL-E n>1 falha | InvalidRequestError | DALL-E 3 suporta apenas n=1 | Defina n=1; faça múltiplas requisições para múltiplas imagens |
Verificação de Conhecimento
1. Quais tamanhos de imagem o DALL-E 3 suporta no Azure OpenAI?
2. Como as imagens são fornecidas ao GPT-4o para análise de visão?
3. O que o parâmetro 'detail' controla ao enviar imagens para o GPT-4o?
4. O que é o campo 'revised_prompt' em uma resposta do DALL-E 3?
5. Por quanto tempo as URLs de imagens geradas pelo DALL-E 3 são válidas antes de expirarem?
Limpeza
az group delete --name rg-ai102-challenge18 --yes --no-wait