Skip to main content

Challenge 46: Custom Document Intelligence Models

Estimated Time

60-90 min | Cost: ~$2.00 (Document Intelligence S0 + custom model training) | Domain: Knowledge Mining & Extraction (15-20%)

Exam skills covered

SkillWeight
Train a custom extraction modelHigh
Label training data for custom modelsHigh
Evaluate custom model accuracyMedium
Create a composed model from multiple custom modelsHigh
Use custom models for inferenceMedium

Overview

When prebuilt models don't match your document formats, custom models let you train extraction on YOUR specific documents. Azure Document Intelligence supports two custom model approaches:

Model typeTraining approachWhen to use
Custom templateFixed layout, labeled fieldsForms with consistent structure (same layout every time)
Custom neuralVariable layout, machine learningDocuments with varied layouts (different vendor invoice formats)

Composed models

A composed model routes incoming documents to the correct sub-model automatically. For example, you might compose:

  • Invoice Model A (for Vendor X layout)
  • Invoice Model B (for Vendor Y layout)
  • Invoice Model C (for Vendor Z layout)

The composed model classifies the document and routes to the appropriate sub-model.

Training workflow

  1. Collect 5+ sample documents (minimum 5 for template, 10+ for neural)
  2. Label fields in Document Intelligence Studio
  3. Train the model
  4. Test with new documents
  5. Deploy or compose with other models

Prerequisites

  • Completed Challenge 45 (Document Intelligence resource)
  • Azure Storage Account with sample training documents
  • Document Intelligence Studio access
  • Python 3.9+ with azure-ai-documentintelligence>=1.0.0
  • .NET 8 with Azure.AI.DocumentIntelligence

Implementation

Task 1: Prepare training data in Azure Storage

RG="rg-ai102-docintell"
STORAGE_ACCOUNT="stai102doctrain$(openssl rand -hex 4)"
CONTAINER="training-data"

# Create storage account for training data
az storage account create \
--name $STORAGE_ACCOUNT \
--resource-group $RG \
--location eastus \
--sku Standard_LRS

# Create container
az storage container create \
--name $CONTAINER \
--account-name $STORAGE_ACCOUNT \
--auth-mode login

# Enable CORS for Document Intelligence Studio
az storage cors add \
--services b \
--methods GET PUT OPTIONS POST \
--origins "https://documentintelligence.ai.azure.com" \
--allowed-headers "*" \
--exposed-headers "*" \
--max-age 200 \
--account-name $STORAGE_ACCOUNT

# Upload sample training documents (at least 5)
# In practice, upload your actual business documents here
for i in {1..6}; do
echo "PurchaseOrder #PO-${i}001
Vendor: Contoso Supplies Inc.
Date: 2024-0${i}-15
Item: Widget Model ${i}
Quantity: ${i}0
Unit Price: \$${i}5.00
Total: \$${i}50.00
Ship To: 123 Main St, Seattle WA 98101" > "po-sample-${i}.txt"

az storage blob upload \
--account-name $STORAGE_ACCOUNT \
--container-name $CONTAINER \
--name "po-sample-${i}.txt" \
--file "po-sample-${i}.txt" \
--auth-mode login
done

# Get SAS URL for Document Intelligence Studio
EXPIRY=$(date -u -d "1 day" '+%Y-%m-%dT%H:%MZ')
SAS_URL=$(az storage container generate-sas \
--account-name $STORAGE_ACCOUNT \
--name $CONTAINER \
--permissions rl \
--expiry $EXPIRY \
--https-only \
-o tsv)

echo "Training data URL: https://${STORAGE_ACCOUNT}.blob.core.windows.net/${CONTAINER}?${SAS_URL}"

Task 2: Build and train a custom model

Studio vs API

Training with labeled data is most easily done in Document Intelligence Studio. The Studio provides a visual labeling interface. The API below shows programmatic model building for automation.

from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceAdministrationClient
from azure.ai.documentintelligence.models import (
BuildDocumentModelRequest,
AzureBlobContentSource,
DocumentBuildMode,
)

admin_client = DocumentIntelligenceAdministrationClient(
endpoint=DOC_ENDPOINT,
credential=AzureKeyCredential(DOC_KEY)
)

# Build custom model from labeled training data
# Note: Labels (.ocr.json and .labels.json) must exist in the container
# These are created by Document Intelligence Studio during labeling
poller = admin_client.begin_build_document_model(
BuildDocumentModelRequest(
model_id="purchase-order-model",
description="Custom model for Contoso purchase orders",
build_mode=DocumentBuildMode.TEMPLATE,
azure_blob_source=AzureBlobContentSource(
container_url=f"https://{STORAGE_ACCOUNT}.blob.core.windows.net/{CONTAINER}?{SAS_URL}"
)
)
)

model = poller.result()
print(f"Model ID: {model.model_id}")
print(f"Status: {model.status}")
print(f"Created: {model.created_date_time}")
print(f"Doc types: {list(model.doc_types.keys())}")

# Show field schema
for doc_type, doc_type_info in model.doc_types.items():
print(f"\nDocument type: {doc_type}")
for field_name, field_info in doc_type_info.field_schema.items():
print(f" {field_name}: {field_info['type']} (confidence: {doc_type_info.field_confidence.get(field_name, 'N/A')})")

Task 3: Test the custom model

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest

client = DocumentIntelligenceClient(
endpoint=DOC_ENDPOINT,
credential=AzureKeyCredential(DOC_KEY)
)

# Analyze a new document with the custom model
test_url = f"https://{STORAGE_ACCOUNT}.blob.core.windows.net/{CONTAINER}/po-sample-1.txt?{SAS_URL}"

poller = client.begin_analyze_document(
"purchase-order-model",
AnalyzeDocumentRequest(url_source=test_url)
)
result = poller.result()

for document in result.documents:
print(f"Document type: {document.doc_type}")
print(f"Confidence: {document.confidence:.2%}")
for field_name, field in document.fields.items():
print(f" {field_name}: {field.content} (confidence: {field.confidence:.2%})")

Task 4: Create a composed model

from azure.ai.documentintelligence.models import (
ComposeDocumentModelRequest,
DocumentTypeDetails,
)

# Compose multiple custom models into one
# The composed model auto-classifies and routes to the correct sub-model
poller = admin_client.begin_compose_model(
ComposeDocumentModelRequest(
model_id="composed-documents-model",
description="Composed model routing purchase orders and invoices",
component_models=[
{"model_id": "purchase-order-model"},
{"model_id": "invoice-custom-model"}, # assume this exists
]
)
)

composed_model = poller.result()
print(f"Composed model ID: {composed_model.model_id}")
print(f"Component models: {len(composed_model.doc_types)} document types")
for doc_type in composed_model.doc_types:
print(f" - {doc_type}")

# Use the composed model — it auto-classifies the document
poller = client.begin_analyze_document(
"composed-documents-model",
AnalyzeDocumentRequest(url_source=test_url)
)
result = poller.result()
for document in result.documents:
print(f"Classified as: {document.doc_type} (confidence: {document.confidence:.2%})")

Task 5: Model management — list, get, delete

# List all models
models = admin_client.list_models()
for model in models:
print(f" {model.model_id} | Created: {model.created_date_time} | Status: {model.status}")

# Get model details
model_info = admin_client.get_model("purchase-order-model")
print(f"\nModel: {model_info.model_id}")
print(f" Description: {model_info.description}")
print(f" Build mode: {model_info.build_mode}")
print(f" Training documents: {model_info.training_documents_count if hasattr(model_info, 'training_documents_count') else 'N/A'}")

# Delete a model
admin_client.delete_model("purchase-order-model")
print("Model deleted")

Expected Output

Model ID: purchase-order-model
Status: ready
Created: 2024-03-15T10:30:00Z
Doc types: ['purchase-order-model']

Document type: purchase-order-model
PurchaseOrderNumber: string (confidence: 0.95)
VendorName: string (confidence: 0.92)
OrderDate: date (confidence: 0.90)
ItemDescription: string (confidence: 0.88)
Quantity: number (confidence: 0.91)
UnitPrice: number (confidence: 0.89)
Total: number (confidence: 0.93)

Break & fix

#ScenarioSymptomRoot CauseFix
1Training fails with "Not enough documents"Build operation failsTemplate models need 5+ labeled documents; neural needs 10+Add more training samples with consistent labeling
2Model classification wrong in composed modelDocument routed to wrong sub-modelTraining data between sub-models is too similar or labels overlapEnsure distinct document layouts; add more diverse training samples
3Custom model returns no fieldsAnalyze succeeds but fields is emptyTest document layout differs significantly from training dataUse neural build mode for variable layouts, or add similar document layouts to training set
4CORS error in Document Intelligence StudioCannot access training data from StudioCORS not configured on storage account for Studio domainAdd CORS rule for https://documentintelligence.ai.azure.com
5SAS token expired403 error when building modelSAS URL used for container has expiredGenerate a new SAS token with sufficient expiry time

Knowledge Check

1. You receive invoices from 5 different vendors, each with a completely different layout. Which custom model build mode is most appropriate?

2. What is the minimum number of training documents required for a custom template model?

3. You have separate custom models for purchase orders, invoices, and receipts. You want a single endpoint that auto-classifies and extracts. What should you create?

4. During custom model training, where do you perform field labeling?

5. What is the maximum number of custom models that can be composed into a single composed model?

Cleanup

az group delete --name rg-ai102-docintell --yes --no-wait
# Also remove local sample files
rm -f po-sample-*.txt

Learn More