Desafio 40: Azure AI Search — Índice e Skillset

Tempo Estimado

60-75 min | Custo: ~$0.50 (Tier gratuito Search + Storage) | Domínio: Knowledge Mining & Extraction (15-20%)

Habilidades do exame cobertas

Habilidade	Peso
Provisionar um recurso Azure AI Search	Alto
Criar uma fonte de dados	Alto
Criar um índice	Alto
Criar e executar um indexador	Alto
Criar um skillset com skills integradas	Alto
Mapear campos enriquecidos para um índice	Médio

Visão Geral

Azure AI Search é um serviço de busca em nuvem que fornece capacidades de indexação e consulta sobre conteúdo heterogêneo. O pipeline de enriquecimento segue esta arquitetura:

Data Source → Indexer → Skillset (enriquecimento com IA) → Index (armazenamento pesquisável)

Conceitos-chave:

Data source: Conexão com o conteúdo (Blob Storage, SQL Database, Cosmos DB, Table Storage)
Index: Schema que define campos pesquisáveis com tipos e atributos (searchable, filterable, sortable, facetable)
Skillset: Coleção de skills de IA que enriquecem o conteúdo durante a indexação (reconhecimento de entidades, extração de frases-chave, detecção de idioma, OCR, análise de imagem)
Indexer: Orquestrador que puxa dados da fonte, executa o skillset e popula o índice

Arquitetura

┌─────────────┐     ┌──────────┐     ┌────────────┐     ┌─────────┐
│ Blob Storage│────▶│ Indexer  │────▶│  Skillset  │────▶│  Index  │
│ (PDFs, imgs)│     │          │     │ (AI Skills)│     │(search) │
└─────────────┘     └──────────┘     └────────────┘     └─────────┘
                                           │
                                     ┌─────┴─────┐
                                     │ AI Services│
                                     │ (multi)    │
                                     └───────────┘

Pré-requisitos

Assinatura Azure com função de Contributor
Azure CLI 2.60+
Python 3.9+ com azure-search-documents>=11.4.0 e azure-identity
.NET 8 SDK com pacote NuGet Azure.Search.Documents
Uma conta de armazenamento com documentos PDF/texto de exemplo carregados em um container

Implementação

Tarefa 1: Provisionar Azure AI Search e carregar dados de exemplo

# Variables
RG="rg-ai102-search"
LOCATION="eastus"
SEARCH_SERVICE="search-ai102-$(openssl rand -hex 4)"
STORAGE_ACCOUNT="stai102search$(openssl rand -hex 4)"
CONTAINER="documents"
AI_SERVICE="ai-services-ai102"

# Create resource group
az group create --name $RG --location $LOCATION

# Create Azure AI Search (Free tier for lab)
az search service create \
  --name $SEARCH_SERVICE \
  --resource-group $RG \
  --location $LOCATION \
  --sku free

# Create storage account and container
az storage account create \
  --name $STORAGE_ACCOUNT \
  --resource-group $RG \
  --location $LOCATION \
  --sku Standard_LRS

az storage container create \
  --name $CONTAINER \
  --account-name $STORAGE_ACCOUNT

# Upload sample documents (create a sample text file)
echo "Azure AI services provide cloud-based AI capabilities. Microsoft Azure offers cognitive services for vision, speech, language, and decision." > sample-doc.txt
az storage blob upload \
  --account-name $STORAGE_ACCOUNT \
  --container-name $CONTAINER \
  --name "sample-doc.txt" \
  --file "sample-doc.txt"

# Create Azure AI Services (multi-service) for skillset
az cognitiveservices account create \
  --name $AI_SERVICE \
  --resource-group $RG \
  --location $LOCATION \
  --kind AIServices \
  --sku S0 \
  --yes

# Get keys
SEARCH_KEY=$(az search admin-key show \
  --resource-group $RG \
  --service-name $SEARCH_SERVICE \
  --query "primaryKey" -o tsv)

STORAGE_CONN=$(az storage account show-connection-string \
  --name $STORAGE_ACCOUNT \
  --resource-group $RG \
  --query "connectionString" -o tsv)

AI_KEY=$(az cognitiveservices account keys list \
  --name $AI_SERVICE \
  --resource-group $RG \
  --query "key1" -o tsv)

Tarefa 2: Criar o índice de busca

Python SDK
C# SDK
REST API

from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
)

# Configuration
endpoint = f"https://{SEARCH_SERVICE}.search.windows.net"
credential = AzureKeyCredential(SEARCH_KEY)

index_client = SearchIndexClient(endpoint=endpoint, credential=credential)

# Define the index schema
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True, filterable=True),
    SearchableField(name="content", type=SearchFieldDataType.String, analyzer_name="en.microsoft"),
    SearchableField(name="metadata_storage_name", type=SearchFieldDataType.String, filterable=True, sortable=True),
    SimpleField(name="metadata_storage_path", type=SearchFieldDataType.String, filterable=True),
    SearchableField(name="keyphrases", type=SearchFieldDataType.Collection(SearchFieldDataType.String), filterable=True, facetable=True),
    SearchableField(name="organizations", type=SearchFieldDataType.Collection(SearchFieldDataType.String), filterable=True, facetable=True),
    SimpleField(name="language", type=SearchFieldDataType.String, filterable=True, facetable=True),
]

index = SearchIndex(name="documents-index", fields=fields)
result = index_client.create_or_update_index(index)
print(f"Index '{result.name}' created successfully")

using Azure;
using Azure.Search.Documents.Indexes;
using Azure.Search.Documents.Indexes.Models;

var endpoint = new Uri($"https://{searchService}.search.windows.net");
var credential = new AzureKeyCredential(searchKey);
var indexClient = new SearchIndexClient(endpoint, credential);

var fields = new List<SearchField>
{
    new SimpleField("id", SearchFieldDataType.String) { IsKey = true, IsFilterable = true },
    new SearchableField("content") { AnalyzerName = LexicalAnalyzerName.EnMicrosoft },
    new SearchableField("metadata_storage_name") { IsFilterable = true, IsSortable = true },
    new SimpleField("metadata_storage_path", SearchFieldDataType.String) { IsFilterable = true },
    new SearchableField("keyphrases", collection: true) { IsFilterable = true, IsFacetable = true },
    new SearchableField("organizations", collection: true) { IsFilterable = true, IsFacetable = true },
    new SimpleField("language", SearchFieldDataType.String) { IsFilterable = true, IsFacetable = true },
};

var index = new SearchIndex("documents-index", fields);
var result = await indexClient.CreateOrUpdateIndexAsync(index);
Console.WriteLine($"Index '{result.Value.Name}' created successfully");

curl -X PUT "https://${SEARCH_SERVICE}.search.windows.net/indexes/documents-index?api-version=2024-07-01" \
  -H "Content-Type: application/json" \
  -H "api-key: ${SEARCH_KEY}" \
  -d '{
    "name": "documents-index",
    "fields": [
      {"name": "id", "type": "Edm.String", "key": true, "filterable": true},
      {"name": "content", "type": "Edm.String", "searchable": true, "analyzer": "en.microsoft"},
      {"name": "metadata_storage_name", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true},
      {"name": "metadata_storage_path", "type": "Edm.String", "filterable": true},
      {"name": "keyphrases", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "facetable": true},
      {"name": "organizations", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "facetable": true},
      {"name": "language", "type": "Edm.String", "filterable": true, "facetable": true}
    ]
  }'

Tarefa 3: Criar a conexão com a fonte de dados

Python SDK
C# SDK
REST API

from azure.search.documents.indexes import SearchIndexerClient
from azure.search.documents.indexes.models import SearchIndexerDataSourceConnection, SearchIndexerDataContainer

indexer_client = SearchIndexerClient(endpoint=endpoint, credential=credential)

data_source = SearchIndexerDataSourceConnection(
    name="blob-datasource",
    type="azureblob",
    connection_string=STORAGE_CONN,
    container=SearchIndexerDataContainer(name="documents")
)

result = indexer_client.create_or_update_data_source_connection(data_source)
print(f"Data source '{result.name}' created")

using Azure.Search.Documents.Indexes.Models;

var indexerClient = new SearchIndexerClient(endpoint, credential);

var dataSource = new SearchIndexerDataSourceConnection(
    name: "blob-datasource",
    type: SearchIndexerDataSourceType.AzureBlob,
    connectionString: storageConnectionString,
    container: new SearchIndexerDataContainer("documents"));

await indexerClient.CreateOrUpdateDataSourceConnectionAsync(dataSource);
Console.WriteLine("Data source 'blob-datasource' created");

curl -X PUT "https://${SEARCH_SERVICE}.search.windows.net/datasources/blob-datasource?api-version=2024-07-01" \
  -H "Content-Type: application/json" \
  -H "api-key: ${SEARCH_KEY}" \
  -d '{
    "name": "blob-datasource",
    "type": "azureblob",
    "credentials": { "connectionString": "'"${STORAGE_CONN}"'" },
    "container": { "name": "documents" }
  }'

Tarefa 4: Criar um skillset com skills integradas

Python SDK
C# SDK
REST API

from azure.search.documents.indexes.models import (
    SearchIndexerSkillset,
    EntityRecognitionSkill,
    KeyPhraseExtractionSkill,
    LanguageDetectionSkill,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    CognitiveServicesAccountKey,
)

# Define built-in skills
key_phrase_skill = KeyPhraseExtractionSkill(
    name="keyphrases-skill",
    description="Extract key phrases from content",
    context="/document",
    inputs=[InputFieldMappingEntry(name="text", source="/document/content")],
    outputs=[OutputFieldMappingEntry(name="keyPhrases", target_name="keyphrases")]
)

entity_skill = EntityRecognitionSkill(
    name="entity-skill",
    description="Recognize organizations",
    context="/document",
    categories=["Organization"],
    inputs=[InputFieldMappingEntry(name="text", source="/document/content")],
    outputs=[OutputFieldMappingEntry(name="organizations", target_name="organizations")]
)

language_skill = LanguageDetectionSkill(
    name="language-skill",
    description="Detect document language",
    context="/document",
    inputs=[InputFieldMappingEntry(name="text", source="/document/content")],
    outputs=[OutputFieldMappingEntry(name="languageCode", target_name="language")]
)

# Create skillset
skillset = SearchIndexerSkillset(
    name="document-skillset",
    description="Enrichment pipeline with key phrases, entities, and language",
    skills=[key_phrase_skill, entity_skill, language_skill],
    cognitive_services_account=CognitiveServicesAccountKey(key=AI_KEY)
)

result = indexer_client.create_or_update_skillset(skillset)
print(f"Skillset '{result.name}' created with {len(result.skills)} skills")

using Azure.Search.Documents.Indexes.Models;

var skills = new List<SearchIndexerSkill>
{
    new KeyPhraseExtractionSkill(
        inputs: new[] { new InputFieldMappingEntry("text") { Source = "/document/content" } },
        outputs: new[] { new OutputFieldMappingEntry("keyPhrases") { TargetName = "keyphrases" } })
    {
        Name = "keyphrases-skill",
        Context = "/document"
    },
    new EntityRecognitionSkill(
        inputs: new[] { new InputFieldMappingEntry("text") { Source = "/document/content" } },
        outputs: new[] { new OutputFieldMappingEntry("organizations") { TargetName = "organizations" } })
    {
        Name = "entity-skill",
        Context = "/document",
        Categories = { EntityCategory.Organization }
    },
    new LanguageDetectionSkill(
        inputs: new[] { new InputFieldMappingEntry("text") { Source = "/document/content" } },
        outputs: new[] { new OutputFieldMappingEntry("languageCode") { TargetName = "language" } })
    {
        Name = "language-skill",
        Context = "/document"
    }
};

var skillset = new SearchIndexerSkillset("document-skillset", skills)
{
    Description = "Enrichment pipeline with key phrases, entities, and language",
    CognitiveServicesAccount = new CognitiveServicesAccountKey(aiKey)
};

await indexerClient.CreateOrUpdateSkillsetAsync(skillset);
Console.WriteLine("Skillset 'document-skillset' created");

curl -X PUT "https://${SEARCH_SERVICE}.search.windows.net/skillsets/document-skillset?api-version=2024-07-01" \
  -H "Content-Type: application/json" \
  -H "api-key: ${SEARCH_KEY}" \
  -d '{
    "name": "document-skillset",
    "description": "Enrichment pipeline with key phrases, entities, and language",
    "skills": [
      {
        "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
        "name": "keyphrases-skill",
        "context": "/document",
        "inputs": [{"name": "text", "source": "/document/content"}],
        "outputs": [{"name": "keyPhrases", "targetName": "keyphrases"}]
      },
      {
        "@odata.type": "#Microsoft.Skills.Text.V3.EntityRecognitionSkill",
        "name": "entity-skill",
        "context": "/document",
        "categories": ["Organization"],
        "inputs": [{"name": "text", "source": "/document/content"}],
        "outputs": [{"name": "organizations", "targetName": "organizations"}]
      },
      {
        "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
        "name": "language-skill",
        "context": "/document",
        "inputs": [{"name": "text", "source": "/document/content"}],
        "outputs": [{"name": "languageCode", "targetName": "language"}]
      }
    ],
    "cognitiveServices": {
      "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
      "key": "'"${AI_KEY}"'"
    }
  }'

Tarefa 5: Criar e executar o indexador

Python SDK
C# SDK
REST API

from azure.search.documents.indexes.models import (
    SearchIndexer,
    FieldMapping,
)

indexer = SearchIndexer(
    name="document-indexer",
    data_source_name="blob-datasource",
    target_index_name="documents-index",
    skillset_name="document-skillset",
    field_mappings=[
        FieldMapping(source_field_name="metadata_storage_path", target_field_name="id"),
        FieldMapping(source_field_name="metadata_storage_name", target_field_name="metadata_storage_name"),
    ],
    output_field_mappings=[
        FieldMapping(source_field_name="/document/keyphrases", target_field_name="keyphrases"),
        FieldMapping(source_field_name="/document/organizations", target_field_name="organizations"),
        FieldMapping(source_field_name="/document/language", target_field_name="language"),
    ]
)

result = indexer_client.create_or_update_indexer(indexer)
print(f"Indexer '{result.name}' created")

# Run the indexer
indexer_client.run_indexer(indexer.name)
print("Indexer running...")

# Check status
import time
time.sleep(10)
status = indexer_client.get_indexer_status(indexer.name)
print(f"Status: {status.last_result.status if status.last_result else 'running'}")

var indexer = new SearchIndexer("document-indexer", "blob-datasource", "documents-index")
{
    SkillsetName = "document-skillset",
    FieldMappings =
    {
        new FieldMapping("metadata_storage_path") { TargetFieldName = "id" },
        new FieldMapping("metadata_storage_name") { TargetFieldName = "metadata_storage_name" },
    },
    OutputFieldMappings =
    {
        new FieldMapping("/document/keyphrases") { TargetFieldName = "keyphrases" },
        new FieldMapping("/document/organizations") { TargetFieldName = "organizations" },
        new FieldMapping("/document/language") { TargetFieldName = "language" },
    }
};

await indexerClient.CreateOrUpdateIndexerAsync(indexer);
Console.WriteLine("Indexer 'document-indexer' created");

// Run the indexer
await indexerClient.RunIndexerAsync("document-indexer");
Console.WriteLine("Indexer running...");

// Check status
await Task.Delay(10000);
var status = await indexerClient.GetIndexerStatusAsync("document-indexer");
Console.WriteLine($"Status: {status.Value.LastResult?.Status}");

# Create indexer
curl -X PUT "https://${SEARCH_SERVICE}.search.windows.net/indexers/document-indexer?api-version=2024-07-01" \
  -H "Content-Type: application/json" \
  -H "api-key: ${SEARCH_KEY}" \
  -d '{
    "name": "document-indexer",
    "dataSourceName": "blob-datasource",
    "targetIndexName": "documents-index",
    "skillsetName": "document-skillset",
    "fieldMappings": [
      {"sourceFieldName": "metadata_storage_path", "targetFieldName": "id"},
      {"sourceFieldName": "metadata_storage_name", "targetFieldName": "metadata_storage_name"}
    ],
    "outputFieldMappings": [
      {"sourceFieldName": "/document/keyphrases", "targetFieldName": "keyphrases"},
      {"sourceFieldName": "/document/organizations", "targetFieldName": "organizations"},
      {"sourceFieldName": "/document/language", "targetFieldName": "language"}
    ]
  }'

# Run the indexer
curl -X POST "https://${SEARCH_SERVICE}.search.windows.net/indexers/document-indexer/run?api-version=2024-07-01" \
  -H "api-key: ${SEARCH_KEY}"

# Check indexer status
curl -s "https://${SEARCH_SERVICE}.search.windows.net/indexers/document-indexer/status?api-version=2024-07-01" \
  -H "api-key: ${SEARCH_KEY}" | python -m json.tool

Saída Esperada

Após a conclusão do indexador, consultar o índice deve retornar documentos enriquecidos:

{
  "value": [
    {
      "id": "aHR0cHM6Ly9...",
      "content": "Azure AI services provide cloud-based AI capabilities...",
      "metadata_storage_name": "sample-doc.txt",
      "keyphrases": ["cloud-based AI capabilities", "cognitive services", "Azure AI services"],
      "organizations": ["Microsoft"],
      "language": "en"
    }
  ]
}

Quebra & conserta

#	Cenário	Sintoma	Causa Raiz	Correção
1	Indexador falha com "Could not execute skill"	Status do indexador mostra `transientFailure`	A chave do AI Services é inválida ou o recurso está em uma região diferente do serviço de busca	Certifique-se de que o AI Services está na mesma região; atualize a chave no skillset
2	Campos enriquecidos estão nulos no índice	Documentos são indexados mas `keyphrases` e `organizations` estão vazios	Os mapeamentos de campos de saída usam caminhos de origem incorretos (ex.: prefixo `/document/` ausente)	Corrija os caminhos de origem em `outputFieldMappings` para corresponder ao `targetName` da saída do skillset com o prefixo `/document/`
3	Indexador não consegue conectar ao Blob Storage	`StorageException: Access denied`	A connection string do armazenamento é inválida ou o container não existe	Verifique a connection string e o nome do container na definição da fonte de dados
4	Criação do índice falha com "analyzer not found"	HTTP 400 na criação do índice	Nome do analyzer digitado incorretamente (ex.: `en.Microsoft` em vez de `en.microsoft`)	Use o nome correto do analyzer — eles são case-sensitive
5	Documentos duplicados no índice após re-execução	Contagem de documentos dobra a cada execução	Mapeamento de chave do documento ausente ou incorreto — `metadata_storage_path` precisa de codificação Base64	Use `metadata_storage_path` com a função de mapeamento `base64Encode` como chave

Knowledge Check

1. Você precisa enriquecer documentos com frases-chave e reconhecimento de entidades durante a indexação. Qual componente do Azure AI Search orquestra esse enriquecimento?

2. Você define um KeyPhraseExtractionSkill no seu skillset. A saída da skill é 'keyPhrases' com targetName 'keyphrases'. Qual caminho você usa em outputFieldMappings para mapear isso para o índice?

3. Qual skill integrada você usaria para extrair texto de documentos PDF digitalizados contendo imagens?

4. Você cria um campo de índice com os atributos: searchable=true, filterable=true, facetable=true. Para qual tipo de campo essa configuração é INVÁLIDA?

5. Seu indexador precisa de um recurso Azure AI Services para executar skills cognitivas integradas. O que acontece se você não anexar um?

Limpeza

az group delete --name rg-ai102-search --yes --no-wait

Habilidades do exame cobertas​

Visão Geral​

Arquitetura​

Pré-requisitos​

Implementação​

Tarefa 1: Provisionar Azure AI Search e carregar dados de exemplo​

Tarefa 2: Criar o índice de busca​

Tarefa 3: Criar a conexão com a fonte de dados​

Tarefa 4: Criar um skillset com skills integradas​

Tarefa 5: Criar e executar o indexador​

Saída Esperada​

Quebra & conserta​

Knowledge Check​

Limpeza​

Saiba Mais​