Challenge 36: Speech Translation
Estimated Time
45 min | Cost: $2-5 (estimated) | Domain: Implement NLP Solutions (15-20%)
Exam skills covered
- Translate speech-to-text in multiple languages
- Implement speech-to-speech translation
- Configure continuous translation sessions
Overview
Azure Speech Translation combines speech recognition and text translation in a single pipeline:
Audio Input → Speech Recognition → Translation → Text/Speech Output
Key differences from separate STT + Translator:
- Single API call — lower latency
- Streaming — real-time partial results
- Speech-to-speech — direct translated audio output
- Supports 70+ languages for speech-to-text translation
Classes: SpeechTranslationConfig, TranslationRecognizer
Prerequisites
- Azure subscription
- Azure Speech resource
- Python 3.9+ or .NET 8
- Package:
azure-cognitiveservices-speech(v1.38+)
Implementation
Task 1: Single-Shot Speech Translation
- Python SDK
- C# SDK
import os
import azure.cognitiveservices.speech as speechsdk
# Configure translation
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)
# Set source language (speech input)
translation_config.speech_recognition_language = "en-US"
# Add target languages (text output)
translation_config.add_target_language("es")
translation_config.add_target_language("fr")
translation_config.add_target_language("de")
translation_config.add_target_language("ja")
# Configure audio input from file
audio_config = speechsdk.audio.AudioConfig(filename="english-speech.wav")
recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config,
audio_config=audio_config
)
# Single utterance translation
print("Translating speech...")
result = recognizer.recognize_once()
if result.reason == speechsdk.ResultReason.TranslatedSpeech:
print(f"Recognized (en): {result.text}")
print(f"\nTranslations:")
for lang, translation in result.translations.items():
print(f" [{lang}] {translation}")
elif result.reason == speechsdk.ResultReason.NoMatch:
print("No speech recognized")
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation = result.cancellation_details
print(f"Canceled: {cancellation.reason} - {cancellation.error_details}")
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Translation;
var translationConfig = SpeechTranslationConfig.FromSubscription(
Environment.GetEnvironmentVariable("AZURE_SPEECH_KEY"),
Environment.GetEnvironmentVariable("AZURE_SPEECH_REGION"));
translationConfig.SpeechRecognitionLanguage = "en-US";
translationConfig.AddTargetLanguage("es");
translationConfig.AddTargetLanguage("fr");
translationConfig.AddTargetLanguage("de");
using var audioConfig = AudioConfig.FromWavFileInput("english-speech.wav");
using var recognizer = new TranslationRecognizer(translationConfig, audioConfig);
var result = await recognizer.RecognizeOnceAsync();
if (result.Reason == ResultReason.TranslatedSpeech)
{
Console.WriteLine($"Recognized: {result.Text}");
foreach (var (lang, text) in result.Translations)
Console.WriteLine($" [{lang}] {text}");
}
Task 2: Continuous Speech Translation
- Python SDK
import threading
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("es")
translation_config.add_target_language("fr")
audio_config = speechsdk.audio.AudioConfig(filename="conversation.wav")
recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config,
audio_config=audio_config
)
translations_log = []
done = threading.Event()
def recognizing_handler(evt):
"""Partial/interim results (streaming)"""
print(f" [Partial] {evt.result.text}")
def recognized_handler(evt):
"""Final results"""
if evt.result.reason == speechsdk.ResultReason.TranslatedSpeech:
print(f"\n[Final] EN: {evt.result.text}")
for lang, text in evt.result.translations.items():
print(f" {lang.upper()}: {text}")
translations_log.append({
"source": evt.result.text,
"translations": dict(evt.result.translations)
})
def canceled_handler(evt):
print(f"Canceled: {evt.cancellation_details.reason}")
done.set()
def stopped_handler(evt):
done.set()
# Wire up events
recognizer.recognizing.connect(recognizing_handler)
recognizer.recognized.connect(recognized_handler)
recognizer.canceled.connect(canceled_handler)
recognizer.session_stopped.connect(stopped_handler)
# Start continuous translation
print("Starting continuous translation...\n")
recognizer.start_continuous_recognition()
done.wait()
recognizer.stop_continuous_recognition()
print(f"\n{'='*50}")
print(f"Translated {len(translations_log)} segments")
Task 3: Speech-to-Speech Translation (with voice synthesis)
- Python SDK
# Configure speech-to-speech: translate and synthesize output
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("es")
# Set voice for synthesized translation output
translation_config.voice_name = "es-ES-ElviraNeural"
audio_config = speechsdk.audio.AudioConfig(filename="english-speech.wav")
recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config,
audio_config=audio_config
)
# Handle synthesized audio
def synthesis_handler(evt):
"""Handle translated speech audio output"""
if evt.result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
audio_data = evt.result.audio_data
print(f" Synthesized audio: {len(audio_data)} bytes")
# Save to file
with open("translated-speech-es.wav", "ab") as f:
f.write(audio_data)
recognizer.synthesizing.connect(synthesis_handler)
# Translate and synthesize
result = recognizer.recognize_once()
if result.reason == speechsdk.ResultReason.TranslatedSpeech:
print(f"Source (EN): {result.text}")
print(f"Target (ES): {result.translations['es']}")
print(f"Audio output saved to: translated-speech-es.wav")
Expected Output
Translating speech...
Recognized (en): The quarterly results exceeded expectations with a fifteen percent increase.
Translations:
[es] Los resultados trimestrales superaron las expectativas con un aumento del quince por ciento.
[fr] Les résultats trimestriels ont dépassé les attentes avec une augmentation de quinze pour cent.
[de] Die Quartalsergebnisse übertrafen die Erwartungen mit einem Anstieg von fünfzehn Prozent.
[ja] 四半期の結果は15パーセントの増加で期待を上回りました。
Starting continuous translation...
[Partial] The quarterly
[Partial] The quarterly results
[Final] EN: The quarterly results exceeded expectations.
ES: Los resultados trimestrales superaron las expectativas.
FR: Les résultats trimestriels ont dépassé les attentes.
Translated 3 segments
Break & fix
| Scenario | Symptom | Root Cause | Fix |
|---|---|---|---|
| No translations returned | Empty translations dict | Target language not added to config | Call add_target_language() before creating recognizer |
| Wrong source language | Garbled recognition | Source language mismatch | Set correct speech_recognition_language |
| Synthesis not working | No audio output | Voice name not set or mismatched language | Set voice_name matching target language |
| Partial results missing | No interim feedback | recognizing event not connected | Connect to recognizing event for streaming results |
| Language code error | Invalid language | Using wrong code format | Use BCP-47 codes: "es" not "spanish", "zh-Hans" not "zh" |
Knowledge Check
1. What class is used for speech translation instead of SpeechConfig?
2. How do you specify multiple target languages for speech translation?
3. What is the difference between the 'recognizing' and 'recognized' events?
4. How do you enable speech-to-speech translation (synthesized output)?
5. What advantage does speech translation have over separate STT + Translator API?
Cleanup
az group delete --name rg-ai102-speech --yes --no-wait