Challenge 16: Speech Recognition and Synthesis

Estimated Time

20-30 min | Cost: Free | Domain: Natural Language Processing (15-20%)

Exam skills covered

Identify features and uses for speech recognition
Identify features and uses for speech synthesis
Identify Azure AI Speech service capabilities

Overview

Speech recognition (speech-to-text) converts spoken audio into written text. This powers applications like meeting transcription, voice assistants, closed captioning, and voice commands. Azure AI Speech supports real-time transcription (processing audio as it streams) and batch transcription (processing pre-recorded audio files). It recognizes natural speech patterns including hesitations, filler words, and different speaking styles.

Speech synthesis (text-to-speech) converts written text into natural-sounding spoken audio. Modern neural text-to-speech voices sound remarkably human, with natural intonation, emphasis, and rhythm. Azure AI Speech offers 500+ neural voices across 140+ languages and variants. Use cases include virtual assistants, audiobook narration, accessibility features for visually impaired users, and automated phone systems.

Both capabilities are part of the Azure AI Speech service, which also includes speech translation (real-time translation of spoken audio) and speaker recognition (identifying who is speaking). Together, these enable natural voice-based human-computer interaction.

Explore

Task 1: Understand speech-to-text capabilities

Speech-to-text converts audio into text. Review the key variations:

Feature	Description	Use Case
Real-time transcription	Converts speech to text as it's spoken	Live captions, voice commands
Batch transcription	Processes pre-recorded audio files	Meeting recordings, call center logs
Custom Speech	Trains models for specific vocabulary/accents	Medical terminology, product names
Conversation transcription	Multi-speaker recognition	Meeting notes with speaker labels

Key capabilities:

Automatic punctuation and capitalization
Profanity filtering options
Word-level timestamps
Speaker diarization (identifying different speakers)
Support for 100+ languages and dialects

Task 2: Explore Azure AI Speech Studio

Navigate to: speech.microsoft.com

Browse the Speech Studio interface
Look at the available demos:
- Real-time speech-to-text — Try speaking or upload audio
- Text-to-speech — Enter text and hear it spoken
- Pronunciation assessment — Evaluate pronunciation quality
Under Text to Speech, explore:
- Different voice options (neural voices)
- Different languages and regional variants
- Voice styles (cheerful, sad, angry, etc. for some voices)

Task 3: Understand text-to-speech features

Text-to-speech (TTS) converts text into natural-sounding audio. Review the options:

Feature	Description
Neural voices	AI-generated voices with natural intonation (500+ available)
SSML control	Speech Synthesis Markup Language for fine-tuning pronunciation, speed, pitch
Voice styles	Emotional variations (cheerful, empathetic, angry) for select voices
Custom Neural Voice	Create a unique branded voice from training audio
Audio format options	WAV, MP3, OGG and other formats

SSML Example — controlling speech output:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-US-JennyNeural">
    <prosody rate="slow" pitch="low">
      Welcome to Azure AI Speech services.
    </prosody>
  </voice>
</speak>

Task 4: Compare real-time vs batch processing

Aspect	Real-time	Batch
Input	Streaming audio (microphone)	Audio files (WAV, MP3, etc.)
Latency	Immediate results	Minutes to hours
Best for	Live captioning, voice assistants	Processing recordings, archives
Duration	Continuous or short utterances	Up to hundreds of hours
Output	Streaming text results	JSON/text files with timestamps

Your task: Consider these scenarios and decide which mode fits:

A doctor dictating patient notes during an appointment → Real-time
A company processing 1,000 recorded customer service calls → Batch
Adding subtitles to a live webinar → Real-time
Transcribing a library of podcast episodes → Batch

Azure CLI Alternative

# Create an Azure AI Speech resource (Free tier)
az cognitiveservices account create \
  --name my-speech-resource \
  --resource-group myResourceGroup \
  --kind SpeechServices \
  --sku F0 \
  --location eastus

# List available speech resource keys
az cognitiveservices account keys list \
  --name my-speech-resource \
  --resource-group myResourceGroup

Key Concepts

Concept	Definition
Speech-to-text (STT)	Converts spoken audio into written text (also called speech recognition)
Text-to-speech (TTS)	Converts written text into natural-sounding spoken audio (also called speech synthesis)
Neural voice	AI-generated voice that uses deep neural networks for natural-sounding speech
SSML	Speech Synthesis Markup Language — XML-based format for controlling speech output
Speaker diarization	Identifying and labeling different speakers in an audio recording
Custom Speech	Training a speech recognition model on domain-specific vocabulary or acoustic conditions

Common Misconceptions

Misconception	Reality
Speech-to-text requires silence/studio conditions	Modern models handle background noise, accents, and natural speech patterns well
Text-to-speech always sounds robotic	Neural voices are nearly indistinguishable from human speech in many cases
You need a custom model for basic transcription	The pre-built models work well for general speech; custom models are for specialized vocabulary
Speech services only work in English	Azure AI Speech supports 100+ languages for STT and 140+ languages for TTS
Real-time transcription is always better than batch	Batch is better for large volumes of pre-recorded audio and provides richer metadata

Knowledge Check

1. A call center wants to transcribe thousands of recorded customer calls to analyze them later. Which speech capability should they use?

2. What technology makes modern text-to-speech voices sound natural and human-like?

3. Which feature of speech-to-text identifies different speakers in a conversation?

4. A hospital needs speech recognition that accurately transcribes medical terminology like drug names and procedures. What should they use?

5. What is SSML used for in Azure AI Speech?

Exam skills covered​

Overview​

Explore​

Task 1: Understand speech-to-text capabilities​

Task 2: Explore Azure AI Speech Studio​

Task 3: Understand text-to-speech features​

Task 4: Compare real-time vs batch processing​

Key Concepts​

Common Misconceptions​

Knowledge Check​

Learn More​