Challenge 39: Custom Translation Models
60 min | Cost: $5-15 (estimated) | Domain: Implement NLP Solutions (15-20%)
Exam skills covered
- Implement custom text translation models
- Train and evaluate custom translation with parallel data
- Publish and consume custom translation models
- Implement multi-language question answering
Overview
Custom Translator trains domain-specific translation models using your parallel data (source-target sentence pairs). This improves translation accuracy for specialized terminology:
| Concept | Description |
|---|---|
| Parallel data | Aligned sentence pairs in source and target languages |
| BLEU score | Translation quality metric (0-100, higher = better) |
| Category ID | Identifier used to route requests to your custom model |
| Baseline | Microsoft's general translation model (comparison point) |
| Training | Fine-tuning the baseline with your parallel data |
Multi-language Question Answering allows a single knowledge base to serve answers in multiple languages.
Some Custom Translator operations (project creation, file upload) are primarily done via the Custom Translator portal. This challenge documents the workflow and programmatic consumption of trained models.
Prerequisites
- Azure subscription
- Azure Translator resource (S1 tier for custom translation)
- Parallel training data (TMX, XLIFF, TSV, or TXT files)
- Custom Translator portal access
Implementation
Task 1: Prepare Parallel Training Data
- Python SDK
import os
# Custom translation requires parallel data - aligned sentences
# Format: Tab-separated source and target (or separate aligned files)
# Example: Medical domain English-to-Spanish parallel data
training_data_tsv = """The patient presents with acute bronchitis.\tEl paciente presenta bronquitis aguda.
Administer 500mg amoxicillin three times daily.\tAdministrar 500mg de amoxicilina tres veces al día.
Blood pressure reading is 120 over 80.\tLa lectura de presión arterial es 120 sobre 80.
The MRI shows no abnormalities.\tLa resonancia magnética no muestra anomalías.
Schedule a follow-up appointment in two weeks.\tProgramar una cita de seguimiento en dos semanas.
Patient reports chest pain and shortness of breath.\tEl paciente reporta dolor en el pecho y dificultad para respirar.
Prescribe ibuprofen 400mg as needed for pain.\tRecetar ibuprofeno 400mg según sea necesario para el dolor.
The biopsy results are benign.\tLos resultados de la biopsia son benignos.
Apply topical antibiotic ointment twice daily.\tAplicar ungüento antibiótico tópico dos veces al día.
Refer patient to cardiology for further evaluation.\tReferir al paciente a cardiología para evaluación adicional."""
# Save training file
with open("medical-training-en-es.tsv", "w", encoding="utf-8") as f:
f.write(training_data_tsv)
# Tuning data (separate set for validation)
tuning_data = """Patient exhibits symptoms of type 2 diabetes.\tEl paciente exhibe síntomas de diabetes tipo 2.
Recommend physical therapy twice a week.\tRecomendar fisioterapia dos veces por semana.
Lab results indicate elevated cholesterol.\tLos resultados del laboratorio indican colesterol elevado."""
with open("medical-tuning-en-es.tsv", "w", encoding="utf-8") as f:
f.write(tuning_data)
# Testing data (for BLEU evaluation)
test_data = """Administer insulin injection before meals.\tAdministrar inyección de insulina antes de las comidas.
The X-ray reveals a hairline fracture.\tLa radiografía revela una fractura capilar."""
with open("medical-test-en-es.tsv", "w", encoding="utf-8") as f:
f.write(test_data)
print("Training data files created:")
print(f" medical-training-en-es.tsv ({training_data_tsv.count(chr(10))+1} sentence pairs)")
print(f" medical-tuning-en-es.tsv (3 sentence pairs)")
print(f" medical-test-en-es.tsv (2 sentence pairs)")
print("\nNote: Production models need 10,000+ sentence pairs for significant improvement.")
Task 2: Custom Translator Workflow (Portal + API)
- Python SDK
import requests
import uuid
# Custom Translator portal workflow:
# 1. Create a workspace at https://portal.customtranslator.azure.ai
# 2. Create a project (specify language pair: en → es)
# 3. Upload parallel documents (training, tuning, testing)
# 4. Train the model
# 5. Publish the model (get Category ID)
# After training in the portal, you'll receive a Category ID
# Use this to route translation requests to your custom model
CATEGORY_ID = os.environ.get("CUSTOM_TRANSLATOR_CATEGORY_ID", "your-category-id")
# The BLEU score comparison after training:
print("""
Custom Translation Training Results (example):
================================================
Model: Medical-EN-ES-v1
Language pair: English → Spanish
Training sentences: 10,000
BLEU Score (baseline): 42.5
BLEU Score (custom): 58.3 (+15.8 improvement)
Status: Published
Category ID: {CATEGORY_ID}
Interpretation:
- BLEU < 30: Low quality (general model may be better for this pair)
- BLEU 30-40: Reasonable quality
- BLEU 40-60: Good quality
- BLEU > 60: Excellent quality
""")
Task 3: Consume Custom Translation Model
- Python SDK
- REST API
import requests
import uuid
key = os.environ["AZURE_TRANSLATOR_KEY"]
region = os.environ["AZURE_TRANSLATOR_REGION"]
endpoint = "https://api.cognitive.microsofttranslator.com"
category_id = os.environ.get("CUSTOM_TRANSLATOR_CATEGORY_ID", "general")
def translate_with_custom_model(texts, source_lang, target_lang, category=None):
"""Translate using custom model by specifying category"""
path = "/translate"
params = {
"api-version": "3.0",
"from": source_lang,
"to": target_lang,
}
if category:
params["category"] = category # Routes to custom model
headers = {
"Ocp-Apim-Subscription-Key": key,
"Ocp-Apim-Subscription-Region": region,
"Content-type": "application/json",
"X-ClientTraceId": str(uuid.uuid4())
}
body = [{"text": t} for t in texts]
response = requests.post(endpoint + path, params=params, headers=headers, json=body)
response.raise_for_status()
return response.json()
# Test sentences with medical terminology
medical_texts = [
"The patient presents with acute myocardial infarction.",
"Administer epinephrine 0.3mg intramuscularly immediately.",
"Schedule an echocardiogram to assess ventricular function."
]
# Compare general vs custom model
print("=== General Model (baseline) ===")
general_results = translate_with_custom_model(medical_texts, "en", "es", category="general")
for i, result in enumerate(general_results):
print(f" EN: {medical_texts[i]}")
print(f" ES: {result['translations'][0]['text']}\n")
print("=== Custom Model (medical domain) ===")
custom_results = translate_with_custom_model(medical_texts, "en", "es", category=category_id)
for i, result in enumerate(custom_results):
print(f" EN: {medical_texts[i]}")
print(f" ES: {result['translations'][0]['text']}\n")
TRANSLATOR_KEY="<key>"
REGION="eastus2"
CATEGORY_ID="<your-category-id>"
# Translate with custom model (add category parameter)
curl -s "https://api.cognitive.microsofttranslator.com/translate?api-version=3.0&from=en&to=es&category=${CATEGORY_ID}" \
-H "Ocp-Apim-Subscription-Key: ${TRANSLATOR_KEY}" \
-H "Ocp-Apim-Subscription-Region: ${REGION}" \
-H "Content-Type: application/json" \
-d '[{"text": "The patient presents with acute myocardial infarction."}]' \
| jq '.[0].translations[0].text'
# Compare with general model (category=general or omit)
curl -s "https://api.cognitive.microsofttranslator.com/translate?api-version=3.0&from=en&to=es&category=general" \
-H "Ocp-Apim-Subscription-Key: ${TRANSLATOR_KEY}" \
-H "Ocp-Apim-Subscription-Region: ${REGION}" \
-H "Content-Type: application/json" \
-d '[{"text": "The patient presents with acute myocardial infarction."}]' \
| jq '.[0].translations[0].text'
Task 4: Multi-Language Question Answering
- Python SDK
- REST API
from azure.ai.language.questionanswering import QuestionAnsweringClient
from azure.core.credentials import AzureKeyCredential
# Multi-language QA: one knowledge base serving multiple languages
# The project must be created with multilingualResource=True
qa_client = QuestionAnsweringClient(
endpoint=os.environ["AZURE_AI_ENDPOINT"],
credential=AzureKeyCredential(os.environ["AZURE_AI_KEY"])
)
# Query in different languages against the same knowledge base
multilingual_queries = [
("What is Azure AI?", "en"),
("¿Qué es Azure AI?", "es"),
("Azure AIとは何ですか?", "ja"),
("Qu'est-ce qu'Azure AI?", "fr")
]
print("=== Multi-Language QA ===")
for question, lang in multilingual_queries:
response = qa_client.get_answers(
question=question,
project_name="faq-knowledge-base",
deployment_name="production",
language=lang
)
if response.answers:
top_answer = response.answers[0]
print(f"\n[{lang}] Q: {question}")
print(f" A: {top_answer.answer[:80]}...")
print(f" Confidence: {top_answer.confidence:.3f}")
# Query in Spanish against English knowledge base
curl -s "${ENDPOINT}/language/:query-knowledgebases?projectName=faq-knowledge-base&deploymentName=production&api-version=2023-04-01" \
-H "Ocp-Apim-Subscription-Key: ${KEY}" \
-H "Content-Type: application/json" \
-d '{
"question": "¿Qué es Azure AI?",
"top": 1,
"language": "es"
}' | jq '.answers[0] | {answer: .answer[0:100], confidence: .confidenceScore}'
Expected Output
Training data files created:
medical-training-en-es.tsv (10 sentence pairs)
medical-tuning-en-es.tsv (3 sentence pairs)
medical-test-en-es.tsv (2 sentence pairs)
Custom Translation Training Results (example):
================================================
Model: Medical-EN-ES-v1
BLEU Score (baseline): 42.5
BLEU Score (custom): 58.3 (+15.8 improvement)
=== General Model (baseline) ===
EN: The patient presents with acute myocardial infarction.
ES: El paciente se presenta con infarto agudo de miocardio.
=== Custom Model (medical domain) ===
EN: The patient presents with acute myocardial infarction.
ES: El paciente presenta infarto agudo al miocardio.
=== Multi-Language QA ===
[en] Q: What is Azure AI?
A: Azure AI Services is a collection of cloud-based AI APIs that help developers...
Confidence: 0.953
[es] Q: ¿Qué es Azure AI?
A: Azure AI Services is a collection of cloud-based AI APIs...
Confidence: 0.891
[ja] Q: Azure AIとは何ですか?
A: Azure AI Services is a collection of cloud-based AI APIs...
Confidence: 0.845
Break & fix
| Scenario | Symptom | Root Cause | Fix |
|---|---|---|---|
| Custom model not used | General translations returned | Category ID not specified or incorrect | Verify category parameter matches published model's Category ID |
| Low BLEU score | No improvement over baseline | Insufficient training data or poor alignment | Need 10,000+ aligned sentence pairs; verify alignment quality |
| Training fails | Upload rejected | File format incorrect | Use supported formats: TMX, XLIFF, TSV, aligned TXT |
| Category not found | 400 error on translation | Model not published or expired | Publish model in Custom Translator portal; check expiration |
| Multi-language QA poor | Low confidence cross-language | Project not configured as multilingual | Enable multilingualResource: true when creating project |
Knowledge Check
1. How do you route a translation request to your custom model?
2. What is the BLEU score used for in Custom Translator?
3. What type of training data does Custom Translator require?
4. How does multi-language Question Answering work?
5. What is the minimum recommended amount of parallel training data for meaningful improvement?
Cleanup
az group delete --name rg-ai102-translator --yes --no-wait