Challenge 36: Speech Translation

Estimated Time

45 min | Cost: $2-5 (estimated) | Domain: Implement NLP Solutions (15-20%)

Exam skills covered

Translate speech-to-text in multiple languages
Implement speech-to-speech translation
Configure continuous translation sessions

Overview

Azure Speech Translation combines speech recognition and text translation in a single pipeline:

Audio Input → Speech Recognition → Translation → Text/Speech Output

Key differences from separate STT + Translator:

Single API call — lower latency
Streaming — real-time partial results
Speech-to-speech — direct translated audio output
Supports 70+ languages for speech-to-text translation

Classes: SpeechTranslationConfig, TranslationRecognizer

Prerequisites

Azure subscription
Azure Speech resource
Python 3.9+ or .NET 8
Package: azure-cognitiveservices-speech (v1.38+)

Implementation

Task 1: Single-Shot Speech Translation

Python SDK
C# SDK

import os
import azure.cognitiveservices.speech as speechsdk

# Configure translation
translation_config = speechsdk.translation.SpeechTranslationConfig(
    subscription=os.environ["AZURE_SPEECH_KEY"],
    region=os.environ["AZURE_SPEECH_REGION"]
)

# Set source language (speech input)
translation_config.speech_recognition_language = "en-US"

# Add target languages (text output)
translation_config.add_target_language("es")
translation_config.add_target_language("fr")
translation_config.add_target_language("de")
translation_config.add_target_language("ja")

# Configure audio input from file
audio_config = speechsdk.audio.AudioConfig(filename="english-speech.wav")
recognizer = speechsdk.translation.TranslationRecognizer(
    translation_config=translation_config,
    audio_config=audio_config
)

# Single utterance translation
print("Translating speech...")
result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.TranslatedSpeech:
    print(f"Recognized (en): {result.text}")
    print(f"\nTranslations:")
    for lang, translation in result.translations.items():
        print(f"  [{lang}] {translation}")
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech recognized")
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation = result.cancellation_details
    print(f"Canceled: {cancellation.reason} - {cancellation.error_details}")

using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Translation;

var translationConfig = SpeechTranslationConfig.FromSubscription(
    Environment.GetEnvironmentVariable("AZURE_SPEECH_KEY"),
    Environment.GetEnvironmentVariable("AZURE_SPEECH_REGION"));

translationConfig.SpeechRecognitionLanguage = "en-US";
translationConfig.AddTargetLanguage("es");
translationConfig.AddTargetLanguage("fr");
translationConfig.AddTargetLanguage("de");

using var audioConfig = AudioConfig.FromWavFileInput("english-speech.wav");
using var recognizer = new TranslationRecognizer(translationConfig, audioConfig);

var result = await recognizer.RecognizeOnceAsync();

if (result.Reason == ResultReason.TranslatedSpeech)
{
    Console.WriteLine($"Recognized: {result.Text}");
    foreach (var (lang, text) in result.Translations)
        Console.WriteLine($"  [{lang}] {text}");
}

Task 2: Continuous Speech Translation

Python SDK

import threading

translation_config = speechsdk.translation.SpeechTranslationConfig(
    subscription=os.environ["AZURE_SPEECH_KEY"],
    region=os.environ["AZURE_SPEECH_REGION"]
)
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("es")
translation_config.add_target_language("fr")

audio_config = speechsdk.audio.AudioConfig(filename="conversation.wav")
recognizer = speechsdk.translation.TranslationRecognizer(
    translation_config=translation_config,
    audio_config=audio_config
)

translations_log = []
done = threading.Event()

def recognizing_handler(evt):
    """Partial/interim results (streaming)"""
    print(f"  [Partial] {evt.result.text}")

def recognized_handler(evt):
    """Final results"""
    if evt.result.reason == speechsdk.ResultReason.TranslatedSpeech:
        print(f"\n[Final] EN: {evt.result.text}")
        for lang, text in evt.result.translations.items():
            print(f"        {lang.upper()}: {text}")
        translations_log.append({
            "source": evt.result.text,
            "translations": dict(evt.result.translations)
        })

def canceled_handler(evt):
    print(f"Canceled: {evt.cancellation_details.reason}")
    done.set()

def stopped_handler(evt):
    done.set()

# Wire up events
recognizer.recognizing.connect(recognizing_handler)
recognizer.recognized.connect(recognized_handler)
recognizer.canceled.connect(canceled_handler)
recognizer.session_stopped.connect(stopped_handler)

# Start continuous translation
print("Starting continuous translation...\n")
recognizer.start_continuous_recognition()
done.wait()
recognizer.stop_continuous_recognition()

print(f"\n{'='*50}")
print(f"Translated {len(translations_log)} segments")

Task 3: Speech-to-Speech Translation (with voice synthesis)

Python SDK

# Configure speech-to-speech: translate and synthesize output
translation_config = speechsdk.translation.SpeechTranslationConfig(
    subscription=os.environ["AZURE_SPEECH_KEY"],
    region=os.environ["AZURE_SPEECH_REGION"]
)
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("es")

# Set voice for synthesized translation output
translation_config.voice_name = "es-ES-ElviraNeural"

audio_config = speechsdk.audio.AudioConfig(filename="english-speech.wav")
recognizer = speechsdk.translation.TranslationRecognizer(
    translation_config=translation_config,
    audio_config=audio_config
)

# Handle synthesized audio
def synthesis_handler(evt):
    """Handle translated speech audio output"""
    if evt.result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
        audio_data = evt.result.audio_data
        print(f"  Synthesized audio: {len(audio_data)} bytes")
        # Save to file
        with open("translated-speech-es.wav", "ab") as f:
            f.write(audio_data)

recognizer.synthesizing.connect(synthesis_handler)

# Translate and synthesize
result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.TranslatedSpeech:
    print(f"Source (EN): {result.text}")
    print(f"Target (ES): {result.translations['es']}")
    print(f"Audio output saved to: translated-speech-es.wav")

Expected Output

Translating speech...
Recognized (en): The quarterly results exceeded expectations with a fifteen percent increase.

Translations:
  [es] Los resultados trimestrales superaron las expectativas con un aumento del quince por ciento.
  [fr] Les résultats trimestriels ont dépassé les attentes avec une augmentation de quinze pour cent.
  [de] Die Quartalsergebnisse übertrafen die Erwartungen mit einem Anstieg von fünfzehn Prozent.
  [ja] 四半期の結果は15パーセントの増加で期待を上回りました。

Starting continuous translation...
  [Partial] The quarterly
  [Partial] The quarterly results

[Final] EN: The quarterly results exceeded expectations.
        ES: Los resultados trimestrales superaron las expectativas.
        FR: Les résultats trimestriels ont dépassé les attentes.

Translated 3 segments

Break & fix

Scenario	Symptom	Root Cause	Fix
No translations returned	Empty translations dict	Target language not added to config	Call `add_target_language()` before creating recognizer
Wrong source language	Garbled recognition	Source language mismatch	Set correct `speech_recognition_language`
Synthesis not working	No audio output	Voice name not set or mismatched language	Set `voice_name` matching target language
Partial results missing	No interim feedback	`recognizing` event not connected	Connect to `recognizing` event for streaming results
Language code error	Invalid language	Using wrong code format	Use BCP-47 codes: "es" not "spanish", "zh-Hans" not "zh"

Knowledge Check

1. What class is used for speech translation instead of SpeechConfig?

2. How do you specify multiple target languages for speech translation?

3. What is the difference between the 'recognizing' and 'recognized' events?

4. How do you enable speech-to-speech translation (synthesized output)?

5. What advantage does speech translation have over separate STT + Translator API?

Cleanup

az group delete --name rg-ai102-speech --yes --no-wait

Exam skills covered​

Overview​

Prerequisites​

Implementation​

Task 1: Single-Shot Speech Translation​

Task 2: Continuous Speech Translation​

Task 3: Speech-to-Speech Translation (with voice synthesis)​

Expected Output​

Break & fix​

Knowledge Check​

Cleanup​

Learn More​