Desafio 48: Pipeline de Processamento Multi-Formato

Tempo Estimado

90-120 min | Custo: ~$3.00 (Search Basic + AI Services + Storage) | Domínio: Knowledge Mining & Extraction (15-20%)

Capstone do Domínio 6

Este desafio integra todos os conceitos do Domínio 6: indexação do AI Search, skillsets, Document Intelligence, Content Understanding e knowledge store — em um pipeline completo de processamento de documentos de ponta a ponta.

Habilidades do exame cobertas

Habilidade	Peso
Projetar pipelines de ingestão de documentos de ponta a ponta	Alto
Processar múltiplos formatos de documentos (PDF, imagens, áudio)	Alto
Combinar AI Search com Document Intelligence	Alto
Construir cadeias de enriquecimento com múltiplas skills	Alto
Armazenar e consultar resultados processados	Médio

Visão Geral

O processamento de documentos empresariais requer o tratamento de diversos tipos de conteúdo por meio de um pipeline unificado:

┌─────────────────┐
│  Source Content  │
│  - PDFs         │──┐
│  - Images       │  │    ┌───────────────┐     ┌──────────────┐     ┌─────────────┐
│  - Audio files  │  ├───▶│  Processing   │────▶│  Enrichment  │────▶│   Output    │
│  - Office docs  │  │    │  (Doc Intel)  │     │  (AI Search) │     │  (Index +   │
└─────────────────┘  │    └───────────────┘     └──────────────┘     │  Knowledge  │
                     │                                                │   Store)    │
                     │    ┌───────────────┐                          └─────────────┘
                     └───▶│ Content Under │─────────────────────────────────┘
                          │  standing     │
                          └───────────────┘

Componentes do pipeline:

Ingestão: Upload de documentos multi-formato para o Blob Storage
Extração: Document Intelligence extrai a estrutura de PDFs/formulários
Enriquecimento: Skillset do AI Search adiciona enriquecimento NLP (entidades, keyphrases, idioma)
Processamento personalizado: Content Understanding lida com imagens e classificação
Armazenamento: Resultados vão para o índice de pesquisa (consultas) + knowledge store (analytics)

Pré-requisitos

Desafios 40-47 concluídos (ou conhecimento equivalente)
Azure AI Search (tier Basic)
Azure AI Services (multi-serviço, S0)
Azure Document Intelligence (S0)
Azure Storage Account
Python 3.9+ com:
- azure-search-documents>=11.4.0
- azure-ai-documentintelligence>=1.0.0
- azure-storage-blob>=12.0.0
- openai>=1.0.0

Implementação

Tarefa 1: Configurar a infraestrutura

RG="rg-ai102-pipeline"
LOCATION="eastus"
SEARCH_SERVICE="search-pipeline-$(openssl rand -hex 4)"
STORAGE_ACCOUNT="stpipeline$(openssl rand -hex 4)"
AI_SERVICE="ai-pipeline-$(openssl rand -hex 4)"
DOC_INTEL="docintell-pipeline-$(openssl rand -hex 4)"

az group create --name $RG --location $LOCATION

# Azure AI Search (Basic tier for vector + semantic)
az search service create \
  --name $SEARCH_SERVICE \
  --resource-group $RG \
  --location $LOCATION \
  --sku basic

# Storage Account
az storage account create \
  --name $STORAGE_ACCOUNT \
  --resource-group $RG \
  --location $LOCATION \
  --sku Standard_LRS

# Create containers for different content types
az storage container create --name "pdfs" --account-name $STORAGE_ACCOUNT
az storage container create --name "images" --account-name $STORAGE_ACCOUNT
az storage container create --name "processed" --account-name $STORAGE_ACCOUNT

# AI Services (multi-service)
az cognitiveservices account create \
  --name $AI_SERVICE \
  --resource-group $RG \
  --location $LOCATION \
  --kind AIServices \
  --sku S0 --yes

# Document Intelligence
az cognitiveservices account create \
  --name $DOC_INTEL \
  --resource-group $RG \
  --location $LOCATION \
  --kind FormRecognizer \
  --sku S0 --yes

# Get all keys
SEARCH_KEY=$(az search admin-key show --resource-group $RG --service-name $SEARCH_SERVICE --query "primaryKey" -o tsv)
STORAGE_CONN=$(az storage account show-connection-string --name $STORAGE_ACCOUNT --resource-group $RG --query "connectionString" -o tsv)
AI_KEY=$(az cognitiveservices account keys list --name $AI_SERVICE --resource-group $RG --query "key1" -o tsv)
DOC_ENDPOINT=$(az cognitiveservices account show --name $DOC_INTEL --resource-group $RG --query "properties.endpoint" -o tsv)
DOC_KEY=$(az cognitiveservices account keys list --name $DOC_INTEL --resource-group $RG --query "key1" -o tsv)

Tarefa 2: Criar um índice de pesquisa unificado

Python SDK
REST API

from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticSearch,
    SemanticPrioritizedFields,
    SemanticField,
)

endpoint = f"https://{SEARCH_SERVICE}.search.windows.net"
credential = AzureKeyCredential(SEARCH_KEY)
index_client = SearchIndexClient(endpoint=endpoint, credential=credential)

# Unified index for all content types
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True, filterable=True),
    SearchableField(name="title", type=SearchFieldDataType.String, filterable=True, sortable=True),
    SearchableField(name="content", type=SearchFieldDataType.String),
    SimpleField(name="source_type", type=SearchFieldDataType.String, filterable=True, facetable=True),  # pdf, image, audio
    SimpleField(name="source_path", type=SearchFieldDataType.String, filterable=True),
    SimpleField(name="processed_date", type=SearchFieldDataType.DateTimeOffset, filterable=True, sortable=True),
    SearchableField(name="keyphrases", type=SearchFieldDataType.Collection(SearchFieldDataType.String), filterable=True, facetable=True),
    SearchableField(name="entities", type=SearchFieldDataType.Collection(SearchFieldDataType.String), filterable=True, facetable=True),
    SimpleField(name="language", type=SearchFieldDataType.String, filterable=True, facetable=True),
    SimpleField(name="confidence_score", type=SearchFieldDataType.Double, filterable=True, sortable=True),
    SearchField(
        name="content_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=1536,
        vector_search_profile_name="vector-profile"
    ),
    # Document Intelligence specific fields
    SimpleField(name="doc_type", type=SearchFieldDataType.String, filterable=True, facetable=True),
    SimpleField(name="page_count", type=SearchFieldDataType.Int32, filterable=True),
    SearchableField(name="tables_content", type=SearchFieldDataType.String),
]

vector_search = VectorSearch(
    algorithms=[HnswAlgorithmConfiguration(name="hnsw-config")],
    profiles=[VectorSearchProfile(name="vector-profile", algorithm_configuration_name="hnsw-config")]
)

semantic_search = SemanticSearch(
    configurations=[
        SemanticConfiguration(
            name="semantic-config",
            prioritized_fields=SemanticPrioritizedFields(
                title_field=SemanticField(field_name="title"),
                content_fields=[SemanticField(field_name="content")]
            )
        )
    ]
)

index = SearchIndex(
    name="unified-content-index",
    fields=fields,
    vector_search=vector_search,
    semantic_search=semantic_search
)

index_client.create_or_update_index(index)
print("Unified content index created")

curl -X PUT "https://${SEARCH_SERVICE}.search.windows.net/indexes/unified-content-index?api-version=2024-07-01" \
  -H "Content-Type: application/json" \
  -H "api-key: ${SEARCH_KEY}" \
  -d '{
    "name": "unified-content-index",
    "fields": [
      {"name": "id", "type": "Edm.String", "key": true, "filterable": true},
      {"name": "title", "type": "Edm.String", "searchable": true, "filterable": true},
      {"name": "content", "type": "Edm.String", "searchable": true},
      {"name": "source_type", "type": "Edm.String", "filterable": true, "facetable": true},
      {"name": "keyphrases", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "facetable": true},
      {"name": "entities", "type": "Collection(Edm.String)", "searchable": true, "filterable": true},
      {"name": "language", "type": "Edm.String", "filterable": true, "facetable": true},
      {"name": "content_vector", "type": "Collection(Edm.Single)", "searchable": true, "dimensions": 1536, "vectorSearchProfile": "vector-profile"}
    ],
    "vectorSearch": {
      "algorithms": [{"name": "hnsw-config", "kind": "hnsw"}],
      "profiles": [{"name": "vector-profile", "algorithm": "hnsw-config"}]
    },
    "semantic": {
      "configurations": [{"name": "semantic-config", "prioritizedFields": {"titleField": {"fieldName": "title"}, "contentFields": [{"fieldName": "content"}]}}]
    }
  }'

Tarefa 3: Processar PDFs com Document Intelligence + AI Search

Python SDK
C# SDK

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
from azure.search.documents import SearchClient
from openai import AzureOpenAI
from datetime import datetime, timezone
import hashlib

# Initialize clients
doc_client = DocumentIntelligenceClient(endpoint=DOC_ENDPOINT, credential=AzureKeyCredential(DOC_KEY))
search_client = SearchClient(endpoint=endpoint, index_name="unified-content-index", credential=credential)
aoai_client = AzureOpenAI(api_key=AOAI_KEY, api_version="2024-06-01", azure_endpoint=AOAI_ENDPOINT)

def get_embedding(text: str) -> list[float]:
    response = aoai_client.embeddings.create(input=text[:8000], model="text-embedding-3-small")
    return response.data[0].embedding

def process_pdf(pdf_url: str, file_name: str):
    """Process a PDF through Document Intelligence and index results."""

    # Step 1: Extract content with Document Intelligence (layout model)
    poller = doc_client.begin_analyze_document(
        "prebuilt-layout",
        AnalyzeDocumentRequest(url_source=pdf_url)
    )
    result = poller.result()

    # Extract full text content
    content_parts = []
    for page in result.pages:
        for line in page.lines:
            content_parts.append(line.content)
    full_content = " ".join(content_parts)

    # Extract tables
    tables_text = ""
    if result.tables:
        for table in result.tables:
            table_rows = {}
            for cell in table.cells:
                row = cell.row_index
                if row not in table_rows:
                    table_rows[row] = []
                table_rows[row].append(cell.content)
            for row_cells in table_rows.values():
                tables_text += " | ".join(row_cells) + "\n"

    # Step 2: Generate embedding
    content_vector = get_embedding(full_content[:8000])

    # Step 3: Create search document
    doc_id = hashlib.md5(pdf_url.encode()).hexdigest()
    search_doc = {
        "id": doc_id,
        "title": file_name,
        "content": full_content,
        "source_type": "pdf",
        "source_path": pdf_url,
        "processed_date": datetime.now(timezone.utc).isoformat(),
        "page_count": len(result.pages),
        "tables_content": tables_text,
        "content_vector": content_vector,
        "language": "en",
        "confidence_score": 0.95,
    }

    # Step 4: Upload to index
    search_client.upload_documents([search_doc])
    print(f"Indexed PDF: {file_name} ({len(result.pages)} pages, {len(full_content)} chars)")
    return doc_id

# Process sample PDFs
process_pdf(
    "https://raw.githubusercontent.com/Azure/azure-sdk-for-python/main/sdk/documentintelligence/azure-ai-documentintelligence/samples/sample_forms/forms/Invoice_1.pdf",
    "Invoice_1.pdf"
)

using Azure.AI.DocumentIntelligence;
using Azure.AI.OpenAI;
using Azure.Search.Documents;

async Task<string> ProcessPdfAsync(string pdfUrl, string fileName)
{
    // Extract with Document Intelligence
    var docClient = new DocumentIntelligenceClient(new Uri(docEndpoint), new AzureKeyCredential(docKey));
    var operation = await docClient.AnalyzeDocumentAsync(
        WaitUntil.Completed, "prebuilt-layout",
        new AnalyzeDocumentContent() { UrlSource = new Uri(pdfUrl) });

    var result = operation.Value;
    var content = string.Join(" ", result.Pages.SelectMany(p => p.Lines).Select(l => l.Content));

    // Generate embedding
    var embeddingClient = new AzureOpenAIClient(new Uri(aoaiEndpoint), new AzureKeyCredential(aoaiKey))
        .GetEmbeddingClient("text-embedding-3-small");
    var embedding = await embeddingClient.GenerateEmbeddingAsync(content[..Math.Min(content.Length, 8000)]);

    // Index document
    var searchClient = new SearchClient(new Uri(searchEndpoint), "unified-content-index", new AzureKeyCredential(searchKey));
    var docId = Convert.ToHexString(System.Security.Cryptography.MD5.HashData(System.Text.Encoding.UTF8.GetBytes(pdfUrl))).ToLower();

    var searchDoc = new SearchDocument(new Dictionary<string, object>
    {
        ["id"] = docId,
        ["title"] = fileName,
        ["content"] = content,
        ["source_type"] = "pdf",
        ["page_count"] = result.Pages.Count,
        ["content_vector"] = embedding.Value.ToFloats().ToArray()
    });

    await searchClient.UploadDocumentsAsync(new[] { searchDoc });
    Console.WriteLine($"Indexed: {fileName}");
    return docId;
}

Tarefa 4: Processar imagens com enriquecimento

Python SDK
REST API

import requests

def process_image(image_url: str, file_name: str):
    """Process an image using Content Understanding and index results."""

    # Use Content Understanding to analyze image
    api_version = "2024-12-01-preview"
    analyze_url = f"{AI_ENDPOINT.rstrip('/')}/contentunderstanding/analyzers/image-analyzer:analyze?api-version={api_version}"

    response = requests.post(
        analyze_url,
        headers={
            "Ocp-Apim-Subscription-Key": AI_KEY,
            "Content-Type": "application/json"
        },
        json={"url": image_url}
    )

    if response.status_code == 202:
        operation_url = response.headers["Operation-Location"]
        import time
        while True:
            time.sleep(3)
            poll = requests.get(operation_url, headers={"Ocp-Apim-Subscription-Key": AI_KEY})
            data = poll.json()
            if data.get("status") == "succeeded":
                result = data.get("result", {})
                break
            elif data.get("status") == "failed":
                print(f"Image analysis failed: {data}")
                return None
    else:
        print(f"Error: {response.status_code}")
        return None

    # Extract content from analysis results
    contents = result.get("contents", [{}])
    fields = contents[0].get("fields", {}) if contents else {}
    description = fields.get("Description", {}).get("value", file_name)
    text_content = fields.get("TextContent", {}).get("value", "")

    # Combine description and text for full content
    full_content = f"{description}. {text_content}" if text_content else description

    # Generate embedding
    content_vector = get_embedding(full_content)

    # Index
    doc_id = hashlib.md5(image_url.encode()).hexdigest()
    search_doc = {
        "id": doc_id,
        "title": file_name,
        "content": full_content,
        "source_type": "image",
        "source_path": image_url,
        "processed_date": datetime.now(timezone.utc).isoformat(),
        "content_vector": content_vector,
        "confidence_score": fields.get("Description", {}).get("confidence", 0.0),
    }

    search_client.upload_documents([search_doc])
    print(f"Indexed image: {file_name}")
    return doc_id

# Process image through Content Understanding
curl -s -i -X POST \
  "${AI_ENDPOINT}/contentunderstanding/analyzers/image-analyzer:analyze?api-version=2024-12-01-preview" \
  -H "Ocp-Apim-Subscription-Key: ${AI_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/sample-image.jpg"}'

# After getting results, upload to search index
curl -X POST "https://${SEARCH_SERVICE}.search.windows.net/indexes/unified-content-index/docs/index?api-version=2024-07-01" \
  -H "Content-Type: application/json" \
  -H "api-key: ${SEARCH_KEY}" \
  -d '{
    "value": [
      {
        "@search.action": "upload",
        "id": "img-001",
        "title": "product-photo.jpg",
        "content": "Product packaging showing the new AI-powered widget with blue branding",
        "source_type": "image"
      }
    ]
  }'

Tarefa 5: Consultar o índice unificado

Python SDK
REST API

from azure.search.documents.models import VectorizedQuery

# Hybrid query across all content types
query_text = "What invoices mention consulting services?"
query_vector = get_embedding(query_text)

results = search_client.search(
    search_text=query_text,
    vector_queries=[
        VectorizedQuery(vector=query_vector, k_nearest_neighbors=5, fields="content_vector")
    ],
    query_type="semantic",
    semantic_configuration_name="semantic-config",
    filter="source_type eq 'pdf'",
    facets=["source_type", "language"],
    include_total_count=True,
    select=["id", "title", "content", "source_type", "confidence_score"],
    top=10
)

print(f"=== Pipeline Query Results ===")
print(f"Total: {results.get_count()}")
print(f"\nFacets:")
for facet_name, facet_values in results.get_facets().items():
    print(f"  {facet_name}:")
    for fv in facet_values:
        print(f"    {fv['value']}: {fv['count']}")

print(f"\nResults:")
for r in results:
    print(f"  [{r['source_type']}] {r['title']} (score: {r['@search.score']:.4f})")
    print(f"    {r['content'][:100]}...")

curl -s -X POST "https://${SEARCH_SERVICE}.search.windows.net/indexes/unified-content-index/docs/search?api-version=2024-07-01" \
  -H "Content-Type: application/json" \
  -H "api-key: ${SEARCH_KEY}" \
  -d '{
    "search": "consulting services invoice",
    "queryType": "semantic",
    "semanticConfiguration": "semantic-config",
    "filter": "source_type eq '\''pdf'\''",
    "facets": ["source_type", "language"],
    "select": "id,title,content,source_type",
    "top": 10,
    "count": true
  }'

Saída Esperada

=== Pipeline Query Results ===
Total: 3

Facets:
  source_type:
    pdf: 2
    image: 1
  language:
    en: 3

Results:
  [pdf] Invoice_1.pdf (score: 0.0341)
    CONTOSO LTD. Invoice #INV-001 consulting services...
  [pdf] Invoice_2.pdf (score: 0.0289)
    Fabrikam Inc. Professional consulting engagement...
  [image] receipt-scan.jpg (score: 0.0142)
    Scanned receipt showing consulting fee payment...

Quebra & conserta

#	Cenário	Sintoma	Causa Raiz	Correção
1	Processamento de PDF falha para documentos digitalizados	Document Intelligence retorna conteúdo vazio	PDF contém apenas imagens, sem texto selecionável	Use `prebuilt-read` com OCR ou defina `imageAction` na configuração do indexer
2	Incompatibilidade nas dimensões do vetor	Upload falha: "vector dimensions don't match"	Modelo de embedding mudou entre execuções de indexação (ada-002 vs 3-small)	Garanta que todos os documentos usem o mesmo modelo de embedding; reconstrua o índice se o modelo mudar
3	Pesquisa cross-format retorna resultados enviesados	PDFs sempre ficam melhor ranqueados que imagens	Conteúdo de PDF é mais longo, gerando scores BM25 mais altos	Use ranking semântico para normalizar; considere ajuste de relevância separado por tipo de fonte
4	Knowledge store com dados faltando	Projeções de tabela vazias para conteúdo de imagem	Imagens não produzem dados estruturados de tabela	Projete projeções por tipo de conteúdo; use projeções condicionais ou skillsets separados
5	Gargalo de throughput no pipeline	Processar 1000 documentos leva horas	Processamento sequencial; sem paralelismo	Use processamento em lote, operações assíncronas e aumente `maxFailedItems`/`batchSize` do indexer

Verificação de Conhecimento

1. Você está construindo um pipeline que processa PDFs, imagens e arquivos de áudio em um único índice de pesquisa. Qual é a MELHOR abordagem para lidar com esses diferentes formatos?

2. Seu pipeline gera embeddings para documentos antes da indexação. Um novo modelo de embedding é lançado com melhor desempenho. O que você deve fazer?

3. Um PDF processado pelo Document Intelligence retorna 50 páginas de conteúdo. Você precisa indexá-lo para busca vetorial. Qual etapa de pré-processamento é recomendada?

4. Você quer consultar seu índice unificado por 'todas as faturas da Contoso com valor acima de $1000'. Qual combinação de recursos de pesquisa é mais apropriada?

5. Seu pipeline processa 10.000 documentos diariamente. A etapa de extração do Document Intelligence é o gargalo. Como você escala isso?

Limpeza

az group delete --name rg-ai102-pipeline --yes --no-wait

Habilidades do exame cobertas​

Visão Geral​

Componentes do pipeline:​

Pré-requisitos​

Implementação​

Tarefa 1: Configurar a infraestrutura​

Tarefa 2: Criar um índice de pesquisa unificado​

Tarefa 3: Processar PDFs com Document Intelligence + AI Search​

Tarefa 4: Processar imagens com enriquecimento​

Tarefa 5: Consultar o índice unificado​

Saída Esperada​

Quebra & conserta​

Verificação de Conhecimento​

Limpeza​

Saiba Mais​