Skip to main content

Challenge 16: Speech Recognition and Synthesis

Estimated Time

20-30 min | Cost: Free | Domain: Natural Language Processing (15-20%)

Exam skills covered

  • Identify features and uses for speech recognition
  • Identify features and uses for speech synthesis
  • Identify Azure AI Speech service capabilities

Overview

Speech recognition (speech-to-text) converts spoken audio into written text. This powers applications like meeting transcription, voice assistants, closed captioning, and voice commands. Azure AI Speech supports real-time transcription (processing audio as it streams) and batch transcription (processing pre-recorded audio files). It recognizes natural speech patterns including hesitations, filler words, and different speaking styles.

Speech synthesis (text-to-speech) converts written text into natural-sounding spoken audio. Modern neural text-to-speech voices sound remarkably human, with natural intonation, emphasis, and rhythm. Azure AI Speech offers 500+ neural voices across 140+ languages and variants. Use cases include virtual assistants, audiobook narration, accessibility features for visually impaired users, and automated phone systems.

Both capabilities are part of the Azure AI Speech service, which also includes speech translation (real-time translation of spoken audio) and speaker recognition (identifying who is speaking). Together, these enable natural voice-based human-computer interaction.

Explore

Task 1: Understand speech-to-text capabilities

Speech-to-text converts audio into text. Review the key variations:

FeatureDescriptionUse Case
Real-time transcriptionConverts speech to text as it's spokenLive captions, voice commands
Batch transcriptionProcesses pre-recorded audio filesMeeting recordings, call center logs
Custom SpeechTrains models for specific vocabulary/accentsMedical terminology, product names
Conversation transcriptionMulti-speaker recognitionMeeting notes with speaker labels

Key capabilities:

  • Automatic punctuation and capitalization
  • Profanity filtering options
  • Word-level timestamps
  • Speaker diarization (identifying different speakers)
  • Support for 100+ languages and dialects

Task 2: Explore Azure AI Speech Studio

Navigate to: speech.microsoft.com

  1. Browse the Speech Studio interface
  2. Look at the available demos:
    • Real-time speech-to-text — Try speaking or upload audio
    • Text-to-speech — Enter text and hear it spoken
    • Pronunciation assessment — Evaluate pronunciation quality
  3. Under Text to Speech, explore:
    • Different voice options (neural voices)
    • Different languages and regional variants
    • Voice styles (cheerful, sad, angry, etc. for some voices)

Task 3: Understand text-to-speech features

Text-to-speech (TTS) converts text into natural-sounding audio. Review the options:

FeatureDescription
Neural voicesAI-generated voices with natural intonation (500+ available)
SSML controlSpeech Synthesis Markup Language for fine-tuning pronunciation, speed, pitch
Voice stylesEmotional variations (cheerful, empathetic, angry) for select voices
Custom Neural VoiceCreate a unique branded voice from training audio
Audio format optionsWAV, MP3, OGG and other formats

SSML Example — controlling speech output:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<prosody rate="slow" pitch="low">
Welcome to Azure AI Speech services.
</prosody>
</voice>
</speak>

Task 4: Compare real-time vs batch processing

AspectReal-timeBatch
InputStreaming audio (microphone)Audio files (WAV, MP3, etc.)
LatencyImmediate resultsMinutes to hours
Best forLive captioning, voice assistantsProcessing recordings, archives
DurationContinuous or short utterancesUp to hundreds of hours
OutputStreaming text resultsJSON/text files with timestamps

Your task: Consider these scenarios and decide which mode fits:

  1. A doctor dictating patient notes during an appointment → Real-time
  2. A company processing 1,000 recorded customer service calls → Batch
  3. Adding subtitles to a live webinar → Real-time
  4. Transcribing a library of podcast episodes → Batch
Azure CLI Alternative
# Create an Azure AI Speech resource (Free tier)
az cognitiveservices account create \
--name my-speech-resource \
--resource-group myResourceGroup \
--kind SpeechServices \
--sku F0 \
--location eastus

# List available speech resource keys
az cognitiveservices account keys list \
--name my-speech-resource \
--resource-group myResourceGroup

Key Concepts

ConceptDefinition
Speech-to-text (STT)Converts spoken audio into written text (also called speech recognition)
Text-to-speech (TTS)Converts written text into natural-sounding spoken audio (also called speech synthesis)
Neural voiceAI-generated voice that uses deep neural networks for natural-sounding speech
SSMLSpeech Synthesis Markup Language — XML-based format for controlling speech output
Speaker diarizationIdentifying and labeling different speakers in an audio recording
Custom SpeechTraining a speech recognition model on domain-specific vocabulary or acoustic conditions

Common Misconceptions

MisconceptionReality
Speech-to-text requires silence/studio conditionsModern models handle background noise, accents, and natural speech patterns well
Text-to-speech always sounds roboticNeural voices are nearly indistinguishable from human speech in many cases
You need a custom model for basic transcriptionThe pre-built models work well for general speech; custom models are for specialized vocabulary
Speech services only work in EnglishAzure AI Speech supports 100+ languages for STT and 140+ languages for TTS
Real-time transcription is always better than batchBatch is better for large volumes of pre-recorded audio and provides richer metadata

Knowledge Check

1. A call center wants to transcribe thousands of recorded customer calls to analyze them later. Which speech capability should they use?

2. What technology makes modern text-to-speech voices sound natural and human-like?

3. Which feature of speech-to-text identifies different speakers in a conversation?

4. A hospital needs speech recognition that accurately transcribes medical terminology like drug names and procedures. What should they use?

5. What is SSML used for in Azure AI Speech?

Learn More