Desafio 40: Azure AI Search — Índice e Skillset
60-75 min | Custo: ~$0.50 (Tier gratuito Search + Storage) | Domínio: Knowledge Mining & Extraction (15-20%)
Habilidades do exame cobertas
| Habilidade | Peso |
|---|---|
| Provisionar um recurso Azure AI Search | Alto |
| Criar uma fonte de dados | Alto |
| Criar um índice | Alto |
| Criar e executar um indexador | Alto |
| Criar um skillset com skills integradas | Alto |
| Mapear campos enriquecidos para um índice | Médio |
Visão Geral
Azure AI Search é um serviço de busca em nuvem que fornece capacidades de indexação e consulta sobre conteúdo heterogêneo. O pipeline de enriquecimento segue esta arquitetura:
Data Source → Indexer → Skillset (enriquecimento com IA) → Index (armazenamento pesquisável)
Conceitos-chave:
- Data source: Conexão com o conteúdo (Blob Storage, SQL Database, Cosmos DB, Table Storage)
- Index: Schema que define campos pesquisáveis com tipos e atributos (searchable, filterable, sortable, facetable)
- Skillset: Coleção de skills de IA que enriquecem o conteúdo durante a indexação (reconhecimento de entidades, extração de frases-chave, detecção de idioma, OCR, análise de imagem)
- Indexer: Orquestrador que puxa dados da fonte, executa o skillset e popula o índice
Arquitetura
┌─────────────┐ ┌──────────┐ ┌────────────┐ ┌─────────┐
│ Blob Storage│────▶│ Indexer │────▶│ Skillset │────▶│ Index │
│ (PDFs, imgs)│ │ │ │ (AI Skills)│ │(search) │
└─────────────┘ └──────────┘ └────────────┘ └─────────┘
│
┌─────┴─────┐
│ AI Services│
│ (multi) │
└───────────┘
Pré-requisitos
- Assinatura Azure com função de Contributor
- Azure CLI 2.60+
- Python 3.9+ com
azure-search-documents>=11.4.0eazure-identity - .NET 8 SDK com pacote NuGet
Azure.Search.Documents - Uma conta de armazenamento com documentos PDF/texto de exemplo carregados em um container
Implementação
Tarefa 1: Provisionar Azure AI Search e carregar dados de exemplo
# Variables
RG="rg-ai102-search"
LOCATION="eastus"
SEARCH_SERVICE="search-ai102-$(openssl rand -hex 4)"
STORAGE_ACCOUNT="stai102search$(openssl rand -hex 4)"
CONTAINER="documents"
AI_SERVICE="ai-services-ai102"
# Create resource group
az group create --name $RG --location $LOCATION
# Create Azure AI Search (Free tier for lab)
az search service create \
--name $SEARCH_SERVICE \
--resource-group $RG \
--location $LOCATION \
--sku free
# Create storage account and container
az storage account create \
--name $STORAGE_ACCOUNT \
--resource-group $RG \
--location $LOCATION \
--sku Standard_LRS
az storage container create \
--name $CONTAINER \
--account-name $STORAGE_ACCOUNT
# Upload sample documents (create a sample text file)
echo "Azure AI services provide cloud-based AI capabilities. Microsoft Azure offers cognitive services for vision, speech, language, and decision." > sample-doc.txt
az storage blob upload \
--account-name $STORAGE_ACCOUNT \
--container-name $CONTAINER \
--name "sample-doc.txt" \
--file "sample-doc.txt"
# Create Azure AI Services (multi-service) for skillset
az cognitiveservices account create \
--name $AI_SERVICE \
--resource-group $RG \
--location $LOCATION \
--kind AIServices \
--sku S0 \
--yes
# Get keys
SEARCH_KEY=$(az search admin-key show \
--resource-group $RG \
--service-name $SEARCH_SERVICE \
--query "primaryKey" -o tsv)
STORAGE_CONN=$(az storage account show-connection-string \
--name $STORAGE_ACCOUNT \
--resource-group $RG \
--query "connectionString" -o tsv)
AI_KEY=$(az cognitiveservices account keys list \
--name $AI_SERVICE \
--resource-group $RG \
--query "key1" -o tsv)
Tarefa 2: Criar o índice de busca
- Python SDK
- C# SDK
- REST API
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
SearchIndex,
SearchField,
SearchFieldDataType,
SimpleField,
SearchableField,
)
# Configuration
endpoint = f"https://{SEARCH_SERVICE}.search.windows.net"
credential = AzureKeyCredential(SEARCH_KEY)
index_client = SearchIndexClient(endpoint=endpoint, credential=credential)
# Define the index schema
fields = [
SimpleField(name="id", type=SearchFieldDataType.String, key=True, filterable=True),
SearchableField(name="content", type=SearchFieldDataType.String, analyzer_name="en.microsoft"),
SearchableField(name="metadata_storage_name", type=SearchFieldDataType.String, filterable=True, sortable=True),
SimpleField(name="metadata_storage_path", type=SearchFieldDataType.String, filterable=True),
SearchableField(name="keyphrases", type=SearchFieldDataType.Collection(SearchFieldDataType.String), filterable=True, facetable=True),
SearchableField(name="organizations", type=SearchFieldDataType.Collection(SearchFieldDataType.String), filterable=True, facetable=True),
SimpleField(name="language", type=SearchFieldDataType.String, filterable=True, facetable=True),
]
index = SearchIndex(name="documents-index", fields=fields)
result = index_client.create_or_update_index(index)
print(f"Index '{result.name}' created successfully")
using Azure;
using Azure.Search.Documents.Indexes;
using Azure.Search.Documents.Indexes.Models;
var endpoint = new Uri($"https://{searchService}.search.windows.net");
var credential = new AzureKeyCredential(searchKey);
var indexClient = new SearchIndexClient(endpoint, credential);
var fields = new List<SearchField>
{
new SimpleField("id", SearchFieldDataType.String) { IsKey = true, IsFilterable = true },
new SearchableField("content") { AnalyzerName = LexicalAnalyzerName.EnMicrosoft },
new SearchableField("metadata_storage_name") { IsFilterable = true, IsSortable = true },
new SimpleField("metadata_storage_path", SearchFieldDataType.String) { IsFilterable = true },
new SearchableField("keyphrases", collection: true) { IsFilterable = true, IsFacetable = true },
new SearchableField("organizations", collection: true) { IsFilterable = true, IsFacetable = true },
new SimpleField("language", SearchFieldDataType.String) { IsFilterable = true, IsFacetable = true },
};
var index = new SearchIndex("documents-index", fields);
var result = await indexClient.CreateOrUpdateIndexAsync(index);
Console.WriteLine($"Index '{result.Value.Name}' created successfully");
curl -X PUT "https://${SEARCH_SERVICE}.search.windows.net/indexes/documents-index?api-version=2024-07-01" \
-H "Content-Type: application/json" \
-H "api-key: ${SEARCH_KEY}" \
-d '{
"name": "documents-index",
"fields": [
{"name": "id", "type": "Edm.String", "key": true, "filterable": true},
{"name": "content", "type": "Edm.String", "searchable": true, "analyzer": "en.microsoft"},
{"name": "metadata_storage_name", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true},
{"name": "metadata_storage_path", "type": "Edm.String", "filterable": true},
{"name": "keyphrases", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "facetable": true},
{"name": "organizations", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "facetable": true},
{"name": "language", "type": "Edm.String", "filterable": true, "facetable": true}
]
}'
Tarefa 3: Criar a conexão com a fonte de dados
- Python SDK
- C# SDK
- REST API
from azure.search.documents.indexes import SearchIndexerClient
from azure.search.documents.indexes.models import SearchIndexerDataSourceConnection, SearchIndexerDataContainer
indexer_client = SearchIndexerClient(endpoint=endpoint, credential=credential)
data_source = SearchIndexerDataSourceConnection(
name="blob-datasource",
type="azureblob",
connection_string=STORAGE_CONN,
container=SearchIndexerDataContainer(name="documents")
)
result = indexer_client.create_or_update_data_source_connection(data_source)
print(f"Data source '{result.name}' created")
using Azure.Search.Documents.Indexes.Models;
var indexerClient = new SearchIndexerClient(endpoint, credential);
var dataSource = new SearchIndexerDataSourceConnection(
name: "blob-datasource",
type: SearchIndexerDataSourceType.AzureBlob,
connectionString: storageConnectionString,
container: new SearchIndexerDataContainer("documents"));
await indexerClient.CreateOrUpdateDataSourceConnectionAsync(dataSource);
Console.WriteLine("Data source 'blob-datasource' created");
curl -X PUT "https://${SEARCH_SERVICE}.search.windows.net/datasources/blob-datasource?api-version=2024-07-01" \
-H "Content-Type: application/json" \
-H "api-key: ${SEARCH_KEY}" \
-d '{
"name": "blob-datasource",
"type": "azureblob",
"credentials": { "connectionString": "'"${STORAGE_CONN}"'" },
"container": { "name": "documents" }
}'
Tarefa 4: Criar um skillset com skills integradas
- Python SDK
- C# SDK
- REST API
from azure.search.documents.indexes.models import (
SearchIndexerSkillset,
EntityRecognitionSkill,
KeyPhraseExtractionSkill,
LanguageDetectionSkill,
InputFieldMappingEntry,
OutputFieldMappingEntry,
CognitiveServicesAccountKey,
)
# Define built-in skills
key_phrase_skill = KeyPhraseExtractionSkill(
name="keyphrases-skill",
description="Extract key phrases from content",
context="/document",
inputs=[InputFieldMappingEntry(name="text", source="/document/content")],
outputs=[OutputFieldMappingEntry(name="keyPhrases", target_name="keyphrases")]
)
entity_skill = EntityRecognitionSkill(
name="entity-skill",
description="Recognize organizations",
context="/document",
categories=["Organization"],
inputs=[InputFieldMappingEntry(name="text", source="/document/content")],
outputs=[OutputFieldMappingEntry(name="organizations", target_name="organizations")]
)
language_skill = LanguageDetectionSkill(
name="language-skill",
description="Detect document language",
context="/document",
inputs=[InputFieldMappingEntry(name="text", source="/document/content")],
outputs=[OutputFieldMappingEntry(name="languageCode", target_name="language")]
)
# Create skillset
skillset = SearchIndexerSkillset(
name="document-skillset",
description="Enrichment pipeline with key phrases, entities, and language",
skills=[key_phrase_skill, entity_skill, language_skill],
cognitive_services_account=CognitiveServicesAccountKey(key=AI_KEY)
)
result = indexer_client.create_or_update_skillset(skillset)
print(f"Skillset '{result.name}' created with {len(result.skills)} skills")
using Azure.Search.Documents.Indexes.Models;
var skills = new List<SearchIndexerSkill>
{
new KeyPhraseExtractionSkill(
inputs: new[] { new InputFieldMappingEntry("text") { Source = "/document/content" } },
outputs: new[] { new OutputFieldMappingEntry("keyPhrases") { TargetName = "keyphrases" } })
{
Name = "keyphrases-skill",
Context = "/document"
},
new EntityRecognitionSkill(
inputs: new[] { new InputFieldMappingEntry("text") { Source = "/document/content" } },
outputs: new[] { new OutputFieldMappingEntry("organizations") { TargetName = "organizations" } })
{
Name = "entity-skill",
Context = "/document",
Categories = { EntityCategory.Organization }
},
new LanguageDetectionSkill(
inputs: new[] { new InputFieldMappingEntry("text") { Source = "/document/content" } },
outputs: new[] { new OutputFieldMappingEntry("languageCode") { TargetName = "language" } })
{
Name = "language-skill",
Context = "/document"
}
};
var skillset = new SearchIndexerSkillset("document-skillset", skills)
{
Description = "Enrichment pipeline with key phrases, entities, and language",
CognitiveServicesAccount = new CognitiveServicesAccountKey(aiKey)
};
await indexerClient.CreateOrUpdateSkillsetAsync(skillset);
Console.WriteLine("Skillset 'document-skillset' created");
curl -X PUT "https://${SEARCH_SERVICE}.search.windows.net/skillsets/document-skillset?api-version=2024-07-01" \
-H "Content-Type: application/json" \
-H "api-key: ${SEARCH_KEY}" \
-d '{
"name": "document-skillset",
"description": "Enrichment pipeline with key phrases, entities, and language",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
"name": "keyphrases-skill",
"context": "/document",
"inputs": [{"name": "text", "source": "/document/content"}],
"outputs": [{"name": "keyPhrases", "targetName": "keyphrases"}]
},
{
"@odata.type": "#Microsoft.Skills.Text.V3.EntityRecognitionSkill",
"name": "entity-skill",
"context": "/document",
"categories": ["Organization"],
"inputs": [{"name": "text", "source": "/document/content"}],
"outputs": [{"name": "organizations", "targetName": "organizations"}]
},
{
"@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
"name": "language-skill",
"context": "/document",
"inputs": [{"name": "text", "source": "/document/content"}],
"outputs": [{"name": "languageCode", "targetName": "language"}]
}
],
"cognitiveServices": {
"@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
"key": "'"${AI_KEY}"'"
}
}'
Tarefa 5: Criar e executar o indexador
- Python SDK
- C# SDK
- REST API
from azure.search.documents.indexes.models import (
SearchIndexer,
FieldMapping,
)
indexer = SearchIndexer(
name="document-indexer",
data_source_name="blob-datasource",
target_index_name="documents-index",
skillset_name="document-skillset",
field_mappings=[
FieldMapping(source_field_name="metadata_storage_path", target_field_name="id"),
FieldMapping(source_field_name="metadata_storage_name", target_field_name="metadata_storage_name"),
],
output_field_mappings=[
FieldMapping(source_field_name="/document/keyphrases", target_field_name="keyphrases"),
FieldMapping(source_field_name="/document/organizations", target_field_name="organizations"),
FieldMapping(source_field_name="/document/language", target_field_name="language"),
]
)
result = indexer_client.create_or_update_indexer(indexer)
print(f"Indexer '{result.name}' created")
# Run the indexer
indexer_client.run_indexer(indexer.name)
print("Indexer running...")
# Check status
import time
time.sleep(10)
status = indexer_client.get_indexer_status(indexer.name)
print(f"Status: {status.last_result.status if status.last_result else 'running'}")
var indexer = new SearchIndexer("document-indexer", "blob-datasource", "documents-index")
{
SkillsetName = "document-skillset",
FieldMappings =
{
new FieldMapping("metadata_storage_path") { TargetFieldName = "id" },
new FieldMapping("metadata_storage_name") { TargetFieldName = "metadata_storage_name" },
},
OutputFieldMappings =
{
new FieldMapping("/document/keyphrases") { TargetFieldName = "keyphrases" },
new FieldMapping("/document/organizations") { TargetFieldName = "organizations" },
new FieldMapping("/document/language") { TargetFieldName = "language" },
}
};
await indexerClient.CreateOrUpdateIndexerAsync(indexer);
Console.WriteLine("Indexer 'document-indexer' created");
// Run the indexer
await indexerClient.RunIndexerAsync("document-indexer");
Console.WriteLine("Indexer running...");
// Check status
await Task.Delay(10000);
var status = await indexerClient.GetIndexerStatusAsync("document-indexer");
Console.WriteLine($"Status: {status.Value.LastResult?.Status}");
# Create indexer
curl -X PUT "https://${SEARCH_SERVICE}.search.windows.net/indexers/document-indexer?api-version=2024-07-01" \
-H "Content-Type: application/json" \
-H "api-key: ${SEARCH_KEY}" \
-d '{
"name": "document-indexer",
"dataSourceName": "blob-datasource",
"targetIndexName": "documents-index",
"skillsetName": "document-skillset",
"fieldMappings": [
{"sourceFieldName": "metadata_storage_path", "targetFieldName": "id"},
{"sourceFieldName": "metadata_storage_name", "targetFieldName": "metadata_storage_name"}
],
"outputFieldMappings": [
{"sourceFieldName": "/document/keyphrases", "targetFieldName": "keyphrases"},
{"sourceFieldName": "/document/organizations", "targetFieldName": "organizations"},
{"sourceFieldName": "/document/language", "targetFieldName": "language"}
]
}'
# Run the indexer
curl -X POST "https://${SEARCH_SERVICE}.search.windows.net/indexers/document-indexer/run?api-version=2024-07-01" \
-H "api-key: ${SEARCH_KEY}"
# Check indexer status
curl -s "https://${SEARCH_SERVICE}.search.windows.net/indexers/document-indexer/status?api-version=2024-07-01" \
-H "api-key: ${SEARCH_KEY}" | python -m json.tool
Saída Esperada
Após a conclusão do indexador, consultar o índice deve retornar documentos enriquecidos:
{
"value": [
{
"id": "aHR0cHM6Ly9...",
"content": "Azure AI services provide cloud-based AI capabilities...",
"metadata_storage_name": "sample-doc.txt",
"keyphrases": ["cloud-based AI capabilities", "cognitive services", "Azure AI services"],
"organizations": ["Microsoft"],
"language": "en"
}
]
}
Quebra & conserta
| # | Cenário | Sintoma | Causa Raiz | Correção |
|---|---|---|---|---|
| 1 | Indexador falha com "Could not execute skill" | Status do indexador mostra transientFailure | A chave do AI Services é inválida ou o recurso está em uma região diferente do serviço de busca | Certifique-se de que o AI Services está na mesma região; atualize a chave no skillset |
| 2 | Campos enriquecidos estão nulos no índice | Documentos são indexados mas keyphrases e organizations estão vazios | Os mapeamentos de campos de saída usam caminhos de origem incorretos (ex.: prefixo /document/ ausente) | Corrija os caminhos de origem em outputFieldMappings para corresponder ao targetName da saída do skillset com o prefixo /document/ |
| 3 | Indexador não consegue conectar ao Blob Storage | StorageException: Access denied | A connection string do armazenamento é inválida ou o container não existe | Verifique a connection string e o nome do container na definição da fonte de dados |
| 4 | Criação do índice falha com "analyzer not found" | HTTP 400 na criação do índice | Nome do analyzer digitado incorretamente (ex.: en.Microsoft em vez de en.microsoft) | Use o nome correto do analyzer — eles são case-sensitive |
| 5 | Documentos duplicados no índice após re-execução | Contagem de documentos dobra a cada execução | Mapeamento de chave do documento ausente ou incorreto — metadata_storage_path precisa de codificação Base64 | Use metadata_storage_path com a função de mapeamento base64Encode como chave |
Knowledge Check
1. Você precisa enriquecer documentos com frases-chave e reconhecimento de entidades durante a indexação. Qual componente do Azure AI Search orquestra esse enriquecimento?
2. Você define um KeyPhraseExtractionSkill no seu skillset. A saída da skill é 'keyPhrases' com targetName 'keyphrases'. Qual caminho você usa em outputFieldMappings para mapear isso para o índice?
3. Qual skill integrada você usaria para extrair texto de documentos PDF digitalizados contendo imagens?
4. Você cria um campo de índice com os atributos: searchable=true, filterable=true, facetable=true. Para qual tipo de campo essa configuração é INVÁLIDA?
5. Seu indexador precisa de um recurso Azure AI Services para executar skills cognitivas integradas. O que acontece se você não anexar um?
Limpeza
az group delete --name rg-ai102-search --yes --no-wait