Challenge 16: Speech Recognition and Synthesis
20-30 min | Cost: Free | Domain: Natural Language Processing (15-20%)
Exam skills covered
- Identify features and uses for speech recognition
- Identify features and uses for speech synthesis
- Identify Azure AI Speech service capabilities
Overview
Speech recognition (speech-to-text) converts spoken audio into written text. This powers applications like meeting transcription, voice assistants, closed captioning, and voice commands. Azure AI Speech supports real-time transcription (processing audio as it streams) and batch transcription (processing pre-recorded audio files). It recognizes natural speech patterns including hesitations, filler words, and different speaking styles.
Speech synthesis (text-to-speech) converts written text into natural-sounding spoken audio. Modern neural text-to-speech voices sound remarkably human, with natural intonation, emphasis, and rhythm. Azure AI Speech offers 500+ neural voices across 140+ languages and variants. Use cases include virtual assistants, audiobook narration, accessibility features for visually impaired users, and automated phone systems.
Both capabilities are part of the Azure AI Speech service, which also includes speech translation (real-time translation of spoken audio) and speaker recognition (identifying who is speaking). Together, these enable natural voice-based human-computer interaction.
Explore
Task 1: Understand speech-to-text capabilities
Speech-to-text converts audio into text. Review the key variations:
| Feature | Description | Use Case |
|---|---|---|
| Real-time transcription | Converts speech to text as it's spoken | Live captions, voice commands |
| Batch transcription | Processes pre-recorded audio files | Meeting recordings, call center logs |
| Custom Speech | Trains models for specific vocabulary/accents | Medical terminology, product names |
| Conversation transcription | Multi-speaker recognition | Meeting notes with speaker labels |
Key capabilities:
- Automatic punctuation and capitalization
- Profanity filtering options
- Word-level timestamps
- Speaker diarization (identifying different speakers)
- Support for 100+ languages and dialects
Task 2: Explore Azure AI Speech Studio
Navigate to: speech.microsoft.com
- Browse the Speech Studio interface
- Look at the available demos:
- Real-time speech-to-text — Try speaking or upload audio
- Text-to-speech — Enter text and hear it spoken
- Pronunciation assessment — Evaluate pronunciation quality
- Under Text to Speech, explore:
- Different voice options (neural voices)
- Different languages and regional variants
- Voice styles (cheerful, sad, angry, etc. for some voices)
Task 3: Understand text-to-speech features
Text-to-speech (TTS) converts text into natural-sounding audio. Review the options:
| Feature | Description |
|---|---|
| Neural voices | AI-generated voices with natural intonation (500+ available) |
| SSML control | Speech Synthesis Markup Language for fine-tuning pronunciation, speed, pitch |
| Voice styles | Emotional variations (cheerful, empathetic, angry) for select voices |
| Custom Neural Voice | Create a unique branded voice from training audio |
| Audio format options | WAV, MP3, OGG and other formats |
SSML Example — controlling speech output:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<prosody rate="slow" pitch="low">
Welcome to Azure AI Speech services.
</prosody>
</voice>
</speak>
Task 4: Compare real-time vs batch processing
| Aspect | Real-time | Batch |
|---|---|---|
| Input | Streaming audio (microphone) | Audio files (WAV, MP3, etc.) |
| Latency | Immediate results | Minutes to hours |
| Best for | Live captioning, voice assistants | Processing recordings, archives |
| Duration | Continuous or short utterances | Up to hundreds of hours |
| Output | Streaming text results | JSON/text files with timestamps |
Your task: Consider these scenarios and decide which mode fits:
- A doctor dictating patient notes during an appointment → Real-time
- A company processing 1,000 recorded customer service calls → Batch
- Adding subtitles to a live webinar → Real-time
- Transcribing a library of podcast episodes → Batch
# Create an Azure AI Speech resource (Free tier)
az cognitiveservices account create \
--name my-speech-resource \
--resource-group myResourceGroup \
--kind SpeechServices \
--sku F0 \
--location eastus
# List available speech resource keys
az cognitiveservices account keys list \
--name my-speech-resource \
--resource-group myResourceGroup
Key Concepts
| Concept | Definition |
|---|---|
| Speech-to-text (STT) | Converts spoken audio into written text (also called speech recognition) |
| Text-to-speech (TTS) | Converts written text into natural-sounding spoken audio (also called speech synthesis) |
| Neural voice | AI-generated voice that uses deep neural networks for natural-sounding speech |
| SSML | Speech Synthesis Markup Language — XML-based format for controlling speech output |
| Speaker diarization | Identifying and labeling different speakers in an audio recording |
| Custom Speech | Training a speech recognition model on domain-specific vocabulary or acoustic conditions |
Common Misconceptions
| Misconception | Reality |
|---|---|
| Speech-to-text requires silence/studio conditions | Modern models handle background noise, accents, and natural speech patterns well |
| Text-to-speech always sounds robotic | Neural voices are nearly indistinguishable from human speech in many cases |
| You need a custom model for basic transcription | The pre-built models work well for general speech; custom models are for specialized vocabulary |
| Speech services only work in English | Azure AI Speech supports 100+ languages for STT and 140+ languages for TTS |
| Real-time transcription is always better than batch | Batch is better for large volumes of pre-recorded audio and provides richer metadata |
Knowledge Check
1. A call center wants to transcribe thousands of recorded customer calls to analyze them later. Which speech capability should they use?
2. What technology makes modern text-to-speech voices sound natural and human-like?
3. Which feature of speech-to-text identifies different speakers in a conversation?
4. A hospital needs speech recognition that accurately transcribes medical terminology like drug names and procedures. What should they use?
5. What is SSML used for in Azure AI Speech?