Skip to main content

Challenge 36: Speech Translation

Estimated Time

45 min | Cost: $2-5 (estimated) | Domain: Implement NLP Solutions (15-20%)

Exam skills covered

  • Translate speech-to-text in multiple languages
  • Implement speech-to-speech translation
  • Configure continuous translation sessions

Overview

Azure Speech Translation combines speech recognition and text translation in a single pipeline:

Audio Input → Speech Recognition → Translation → Text/Speech Output

Key differences from separate STT + Translator:

  • Single API call — lower latency
  • Streaming — real-time partial results
  • Speech-to-speech — direct translated audio output
  • Supports 70+ languages for speech-to-text translation

Classes: SpeechTranslationConfig, TranslationRecognizer

Prerequisites

  • Azure subscription
  • Azure Speech resource
  • Python 3.9+ or .NET 8
  • Package: azure-cognitiveservices-speech (v1.38+)

Implementation

Task 1: Single-Shot Speech Translation

import os
import azure.cognitiveservices.speech as speechsdk

# Configure translation
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)

# Set source language (speech input)
translation_config.speech_recognition_language = "en-US"

# Add target languages (text output)
translation_config.add_target_language("es")
translation_config.add_target_language("fr")
translation_config.add_target_language("de")
translation_config.add_target_language("ja")

# Configure audio input from file
audio_config = speechsdk.audio.AudioConfig(filename="english-speech.wav")
recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config,
audio_config=audio_config
)

# Single utterance translation
print("Translating speech...")
result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.TranslatedSpeech:
print(f"Recognized (en): {result.text}")
print(f"\nTranslations:")
for lang, translation in result.translations.items():
print(f" [{lang}] {translation}")
elif result.reason == speechsdk.ResultReason.NoMatch:
print("No speech recognized")
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation = result.cancellation_details
print(f"Canceled: {cancellation.reason} - {cancellation.error_details}")

Task 2: Continuous Speech Translation

import threading

translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("es")
translation_config.add_target_language("fr")

audio_config = speechsdk.audio.AudioConfig(filename="conversation.wav")
recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config,
audio_config=audio_config
)

translations_log = []
done = threading.Event()

def recognizing_handler(evt):
"""Partial/interim results (streaming)"""
print(f" [Partial] {evt.result.text}")

def recognized_handler(evt):
"""Final results"""
if evt.result.reason == speechsdk.ResultReason.TranslatedSpeech:
print(f"\n[Final] EN: {evt.result.text}")
for lang, text in evt.result.translations.items():
print(f" {lang.upper()}: {text}")
translations_log.append({
"source": evt.result.text,
"translations": dict(evt.result.translations)
})

def canceled_handler(evt):
print(f"Canceled: {evt.cancellation_details.reason}")
done.set()

def stopped_handler(evt):
done.set()

# Wire up events
recognizer.recognizing.connect(recognizing_handler)
recognizer.recognized.connect(recognized_handler)
recognizer.canceled.connect(canceled_handler)
recognizer.session_stopped.connect(stopped_handler)

# Start continuous translation
print("Starting continuous translation...\n")
recognizer.start_continuous_recognition()
done.wait()
recognizer.stop_continuous_recognition()

print(f"\n{'='*50}")
print(f"Translated {len(translations_log)} segments")

Task 3: Speech-to-Speech Translation (with voice synthesis)

# Configure speech-to-speech: translate and synthesize output
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("es")

# Set voice for synthesized translation output
translation_config.voice_name = "es-ES-ElviraNeural"

audio_config = speechsdk.audio.AudioConfig(filename="english-speech.wav")
recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config,
audio_config=audio_config
)

# Handle synthesized audio
def synthesis_handler(evt):
"""Handle translated speech audio output"""
if evt.result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
audio_data = evt.result.audio_data
print(f" Synthesized audio: {len(audio_data)} bytes")
# Save to file
with open("translated-speech-es.wav", "ab") as f:
f.write(audio_data)

recognizer.synthesizing.connect(synthesis_handler)

# Translate and synthesize
result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.TranslatedSpeech:
print(f"Source (EN): {result.text}")
print(f"Target (ES): {result.translations['es']}")
print(f"Audio output saved to: translated-speech-es.wav")

Expected Output

Translating speech...
Recognized (en): The quarterly results exceeded expectations with a fifteen percent increase.

Translations:
[es] Los resultados trimestrales superaron las expectativas con un aumento del quince por ciento.
[fr] Les résultats trimestriels ont dépassé les attentes avec une augmentation de quinze pour cent.
[de] Die Quartalsergebnisse übertrafen die Erwartungen mit einem Anstieg von fünfzehn Prozent.
[ja] 四半期の結果は15パーセントの増加で期待を上回りました。

Starting continuous translation...
[Partial] The quarterly
[Partial] The quarterly results

[Final] EN: The quarterly results exceeded expectations.
ES: Los resultados trimestrales superaron las expectativas.
FR: Les résultats trimestriels ont dépassé les attentes.

Translated 3 segments

Break & fix

ScenarioSymptomRoot CauseFix
No translations returnedEmpty translations dictTarget language not added to configCall add_target_language() before creating recognizer
Wrong source languageGarbled recognitionSource language mismatchSet correct speech_recognition_language
Synthesis not workingNo audio outputVoice name not set or mismatched languageSet voice_name matching target language
Partial results missingNo interim feedbackrecognizing event not connectedConnect to recognizing event for streaming results
Language code errorInvalid languageUsing wrong code formatUse BCP-47 codes: "es" not "spanish", "zh-Hans" not "zh"

Knowledge Check

1. What class is used for speech translation instead of SpeechConfig?

2. How do you specify multiple target languages for speech translation?

3. What is the difference between the 'recognizing' and 'recognized' events?

4. How do you enable speech-to-speech translation (synthesized output)?

5. What advantage does speech translation have over separate STT + Translator API?

Cleanup

az group delete --name rg-ai102-speech --yes --no-wait

Learn More