Skip to main content

Challenge 40: Azure AI Search — Index and Skillset

Estimated Time

60-75 min | Cost: ~$0.50 (Free tier Search + Storage) | Domain: Knowledge Mining & Extraction (15-20%)

Exam skills covered

SkillWeight
Provision an Azure AI Search resourceHigh
Create a data sourceHigh
Create an indexHigh
Create and run an indexerHigh
Create a skillset with built-in skillsHigh
Map enriched fields to an indexMedium

Overview

Azure AI Search is a cloud search service that provides indexing and querying capabilities over heterogeneous content. The enrichment pipeline follows this architecture:

Data SourceIndexerSkillset (AI enrichment) → Index (searchable store)

Key concepts:

  • Data source: Connection to content (Blob Storage, SQL Database, Cosmos DB, Table Storage)
  • Index: Schema defining searchable fields with types, attributes (searchable, filterable, sortable, facetable)
  • Skillset: Collection of AI skills that enrich content during indexing (entity recognition, key phrase extraction, language detection, OCR, image analysis)
  • Indexer: Orchestrator that pulls data from the source, runs the skillset, and populates the index

Architecture

Challenge 40 - AI Search Indexing Pipeline

Prerequisites

  • Azure subscription with Contributor role
  • Azure CLI 2.60+
  • Python 3.9+ with azure-search-documents>=11.4.0 and azure-identity
  • .NET 8 SDK with Azure.Search.Documents NuGet package
  • A storage account with sample PDF/text documents uploaded to a container

Implementation

Task 1: Provision Azure AI Search and upload sample data

# Variables
RG="rg-ai102-search"
LOCATION="eastus"
SEARCH_SERVICE="search-ai102-$(openssl rand -hex 4)"
STORAGE_ACCOUNT="stai102search$(openssl rand -hex 4)"
CONTAINER="documents"
AI_SERVICE="ai-services-ai102"

# Create resource group
az group create --name $RG --location $LOCATION

# Create Azure AI Search (Free tier for lab)
az search service create \
--name $SEARCH_SERVICE \
--resource-group $RG \
--location $LOCATION \
--sku free

# Create storage account and container
az storage account create \
--name $STORAGE_ACCOUNT \
--resource-group $RG \
--location $LOCATION \
--sku Standard_LRS

az storage container create \
--name $CONTAINER \
--account-name $STORAGE_ACCOUNT \
--auth-mode login

# Upload sample documents (create a sample text file)
echo "Azure AI services provide cloud-based AI capabilities. Microsoft Azure offers cognitive services for vision, speech, language, and decision." > sample-doc.txt
az storage blob upload \
--account-name $STORAGE_ACCOUNT \
--container-name $CONTAINER \
--name "sample-doc.txt" \
--file "sample-doc.txt" \
--auth-mode login

# Create Azure AI Services (multi-service) for skillset
az cognitiveservices account create \
--name $AI_SERVICE \
--resource-group $RG \
--location $LOCATION \
--kind AIServices \
--sku S0 \
--yes

# Get keys
SEARCH_KEY=$(az search admin-key show \
--resource-group $RG \
--service-name $SEARCH_SERVICE \
--query "primaryKey" -o tsv)

STORAGE_CONN=$(az storage account show-connection-string \
--name $STORAGE_ACCOUNT \
--resource-group $RG \
--query "connectionString" -o tsv)

AI_KEY=$(az cognitiveservices account keys list \
--name $AI_SERVICE \
--resource-group $RG \
--query "key1" -o tsv)

Task 2: Create the search index

from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
SearchIndex,
SearchField,
SearchFieldDataType,
SimpleField,
SearchableField,
)

# Configuration
endpoint = f"https://{SEARCH_SERVICE}.search.windows.net"
credential = AzureKeyCredential(SEARCH_KEY)

index_client = SearchIndexClient(endpoint=endpoint, credential=credential)

# Define the index schema
fields = [
SimpleField(name="id", type=SearchFieldDataType.String, key=True, filterable=True),
SearchableField(name="content", type=SearchFieldDataType.String, analyzer_name="en.microsoft"),
SearchableField(name="metadata_storage_name", type=SearchFieldDataType.String, filterable=True, sortable=True),
SimpleField(name="metadata_storage_path", type=SearchFieldDataType.String, filterable=True),
SearchableField(name="keyphrases", type=SearchFieldDataType.Collection(SearchFieldDataType.String), filterable=True, facetable=True),
SearchableField(name="organizations", type=SearchFieldDataType.Collection(SearchFieldDataType.String), filterable=True, facetable=True),
SimpleField(name="language", type=SearchFieldDataType.String, filterable=True, facetable=True),
]

index = SearchIndex(name="documents-index", fields=fields)
result = index_client.create_or_update_index(index)
print(f"Index '{result.name}' created successfully")

Task 3: Create the data source connection

from azure.search.documents.indexes import SearchIndexerClient
from azure.search.documents.indexes.models import SearchIndexerDataSourceConnection, SearchIndexerDataContainer

indexer_client = SearchIndexerClient(endpoint=endpoint, credential=credential)

data_source = SearchIndexerDataSourceConnection(
name="blob-datasource",
type="azureblob",
connection_string=STORAGE_CONN,
container=SearchIndexerDataContainer(name="documents")
)

result = indexer_client.create_or_update_data_source_connection(data_source)
print(f"Data source '{result.name}' created")

Task 4: Create a skillset with built-in skills

from azure.search.documents.indexes.models import (
SearchIndexerSkillset,
EntityRecognitionSkill,
KeyPhraseExtractionSkill,
LanguageDetectionSkill,
InputFieldMappingEntry,
OutputFieldMappingEntry,
CognitiveServicesAccountKey,
)

# Define built-in skills
key_phrase_skill = KeyPhraseExtractionSkill(
name="keyphrases-skill",
description="Extract key phrases from content",
context="/document",
inputs=[InputFieldMappingEntry(name="text", source="/document/content")],
outputs=[OutputFieldMappingEntry(name="keyPhrases", target_name="keyphrases")]
)

entity_skill = EntityRecognitionSkill(
name="entity-skill",
description="Recognize organizations",
context="/document",
categories=["Organization"],
inputs=[InputFieldMappingEntry(name="text", source="/document/content")],
outputs=[OutputFieldMappingEntry(name="organizations", target_name="organizations")]
)

language_skill = LanguageDetectionSkill(
name="language-skill",
description="Detect document language",
context="/document",
inputs=[InputFieldMappingEntry(name="text", source="/document/content")],
outputs=[OutputFieldMappingEntry(name="languageCode", target_name="language")]
)

# Create skillset
skillset = SearchIndexerSkillset(
name="document-skillset",
description="Enrichment pipeline with key phrases, entities, and language",
skills=[key_phrase_skill, entity_skill, language_skill],
cognitive_services_account=CognitiveServicesAccountKey(key=AI_KEY)
)

result = indexer_client.create_or_update_skillset(skillset)
print(f"Skillset '{result.name}' created with {len(result.skills)} skills")

Task 5: Create and run the indexer

from azure.search.documents.indexes.models import (
SearchIndexer,
FieldMapping,
)

indexer = SearchIndexer(
name="document-indexer",
data_source_name="blob-datasource",
target_index_name="documents-index",
skillset_name="document-skillset",
field_mappings=[
FieldMapping(source_field_name="metadata_storage_path", target_field_name="id"),
FieldMapping(source_field_name="metadata_storage_name", target_field_name="metadata_storage_name"),
],
output_field_mappings=[
FieldMapping(source_field_name="/document/keyphrases", target_field_name="keyphrases"),
FieldMapping(source_field_name="/document/organizations", target_field_name="organizations"),
FieldMapping(source_field_name="/document/language", target_field_name="language"),
]
)

result = indexer_client.create_or_update_indexer(indexer)
print(f"Indexer '{result.name}' created")

# Run the indexer
indexer_client.run_indexer(indexer.name)
print("Indexer running...")

# Check status
import time
time.sleep(10)
status = indexer_client.get_indexer_status(indexer.name)
print(f"Status: {status.last_result.status if status.last_result else 'running'}")

Expected Output

After the indexer completes, querying the index should return enriched documents:

{
"value": [
{
"id": "aHR0cHM6Ly9...",
"content": "Azure AI services provide cloud-based AI capabilities...",
"metadata_storage_name": "sample-doc.txt",
"keyphrases": ["cloud-based AI capabilities", "cognitive services", "Azure AI services"],
"organizations": ["Microsoft"],
"language": "en"
}
]
}

Break & fix

#ScenarioSymptomRoot CauseFix
1Indexer fails with "Could not execute skill"Indexer status shows transientFailureAI Services key is invalid or the resource is in a different region than the search serviceEnsure AI Services is in the same region; update the key in the skillset
2Enriched fields are null in the indexDocuments index but keyphrases and organizations are emptyOutput field mappings use incorrect source paths (e.g., missing /document/ prefix)Fix outputFieldMappings source paths to match skillset output targetName with /document/ prefix
3Indexer cannot connect to Blob StorageStorageException: Access deniedStorage connection string is invalid or container doesn't existVerify connection string and container name in data source definition
4Index creation fails with "analyzer not found"HTTP 400 on index creationAnalyzer name misspelled (e.g., en.Microsoft instead of en.microsoft)Use correct analyzer name — they are case-sensitive
5Duplicate documents in index after re-runDocument count doubles on each runMissing or incorrect document key mapping — metadata_storage_path needs Base64 encodingUse metadata_storage_path with base64Encode mapping function as the key

Knowledge Check

1. You need to enrich documents with key phrases and entity recognition during indexing. Which component of Azure AI Search orchestrates this enrichment?

2. You define a KeyPhraseExtractionSkill in your skillset. The skill output is 'keyPhrases' with targetName 'keyphrases'. What path do you use in outputFieldMappings to map this to the index?

3. Which built-in skill would you use to extract text from scanned PDF documents containing images?

4. You create an index field with attributes: searchable=true, filterable=true, facetable=true. Which field type is this configuration INVALID for?

5. Your indexer needs an Azure AI Services resource to run built-in cognitive skills. What happens if you don't attach one?

Cleanup

az group delete --name rg-ai102-search --yes --no-wait

Learn More