Skip to main content

Challenge 34: Speech-to-Text

Estimated Time

50 min | Cost: $2-5 (estimated) | Domain: Implement NLP Solutions (15-20%)

Exam skills covered

  • Implement speech-to-text transcription
  • Configure real-time and batch transcription
  • Implement custom speech models for domain-specific vocabulary

Overview

Azure Speech service provides speech-to-text (STT) capabilities:

ModeDescriptionUse Case
Real-timeContinuous recognition from mic/streamLive captions, voice commands
BatchAsync transcription of audio filesMeeting recordings, call centers
Custom SpeechModels trained on your vocabularyMedical, legal, technical domains

Key classes: SpeechConfig, SpeechRecognizer, AudioConfig

Prerequisites

  • Azure subscription
  • Azure Speech resource
  • Python 3.9+ or .NET 8
  • Package: azure-cognitiveservices-speech (v1.38+)
  • Microphone (for real-time) or audio file (.wav)

Implementation

Task 1: Create Speech Resource

az group create --name rg-ai102-speech --location eastus2

az cognitiveservices account create \
--name speech-ai102 \
--resource-group rg-ai102-speech \
--kind SpeechServices \
--sku S0 \
--location eastus2

SPEECH_KEY=$(az cognitiveservices account keys list --name speech-ai102 --resource-group rg-ai102-speech --query key1 -o tsv)
SPEECH_REGION="eastus2"

Task 2: Real-Time Speech Recognition

import os
import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)
speech_config.speech_recognition_language = "en-US"

# Option 1: Recognize from audio file
audio_config = speechsdk.audio.AudioConfig(filename="meeting-recording.wav")
recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

# Single utterance recognition
print("Recognizing from file...")
result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.RecognizedSpeech:
print(f"Recognized: {result.text}")
print(f"Duration: {result.duration / 10_000_000:.2f} seconds")
elif result.reason == speechsdk.ResultReason.NoMatch:
print(f"No speech recognized: {result.no_match_details}")
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation = result.cancellation_details
print(f"Canceled: {cancellation.reason}")
if cancellation.reason == speechsdk.CancellationReason.Error:
print(f"Error: {cancellation.error_details}")

Task 3: Continuous Recognition (Full Meeting Transcription)

import threading

speech_config = speechsdk.SpeechConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)
speech_config.speech_recognition_language = "en-US"
speech_config.set_property(
speechsdk.PropertyId.SpeechServiceResponse_DiarizeIntermediateResults, "true"
)

audio_config = speechsdk.audio.AudioConfig(filename="long-meeting.wav")
recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

transcript = []
done = threading.Event()

def recognized_handler(evt):
if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
transcript.append(evt.result.text)
print(f" [{evt.result.offset / 10_000_000:.1f}s] {evt.result.text}")

def session_stopped_handler(evt):
done.set()

def canceled_handler(evt):
print(f"Canceled: {evt.cancellation_details.reason}")
done.set()

# Connect event handlers
recognizer.recognized.connect(recognized_handler)
recognizer.session_stopped.connect(session_stopped_handler)
recognizer.canceled.connect(canceled_handler)

# Start continuous recognition
print("Starting continuous recognition...")
recognizer.start_continuous_recognition()
done.wait()
recognizer.stop_continuous_recognition()

# Full transcript
print(f"\n{'='*50}")
print(f"Full transcript ({len(transcript)} segments):")
print(" ".join(transcript))

Task 4: Batch Transcription API

SPEECH_KEY="<your-key>"
REGION="eastus2"

# Create batch transcription job
curl -s "https://${REGION}.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions" \
-H "Ocp-Apim-Subscription-Key: ${SPEECH_KEY}" \
-H "Content-Type: application/json" \
-d '{
"contentUrls": [
"https://storage.blob.core.windows.net/audio/meeting1.wav?sv=...&sig=..."
],
"locale": "en-US",
"displayName": "Meeting Transcription",
"properties": {
"wordLevelTimestampsEnabled": true,
"diarizationEnabled": true,
"maxSpeakerCount": 5,
"punctuationMode": "DictatedAndAutomatic"
}
}' | jq '{id: .self, status: .status}'

# Check status (replace TRANSCRIPTION_URL)
curl -s "https://${REGION}.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions/<id>" \
-H "Ocp-Apim-Subscription-Key: ${SPEECH_KEY}" | jq '.status'

# Get results
curl -s "https://${REGION}.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions/<id>/files" \
-H "Ocp-Apim-Subscription-Key: ${SPEECH_KEY}" | jq '.values[].links.contentUrl'

Expected Output

Recognizing from file...
Recognized: Welcome to the quarterly business review meeting.
Duration: 3.45 seconds

Starting continuous recognition...
[0.5s] Welcome to the quarterly business review meeting.
[4.2s] Today we'll discuss our progress on key initiatives.
[8.1s] Let's start with the revenue numbers from last quarter.
[12.5s] We exceeded our target by fifteen percent.

==================================================
Full transcript (4 segments):
Welcome to the quarterly business review meeting. Today we'll discuss our progress on key initiatives. Let's start with the revenue numbers from last quarter. We exceeded our target by fifteen percent.

Break & fix

ScenarioSymptomRoot CauseFix
NoMatch resultNo speech recognizedAudio is silence, wrong format, or wrong languageVerify WAV format (16kHz, 16-bit, mono PCM); check language setting
Canceled with auth error401 UnauthorizedWrong key or regionVerify key matches region; check resource is active
Truncated recognitionOnly first sentenceUsed recognize_once instead of continuousUse start_continuous_recognition for long audio
Missing wordsIncomplete transcriptDomain-specific vocabularyTrain Custom Speech model with your terminology
High latencySlow resultsNetwork or large audio chunksUse streaming/push audio; check network connectivity

Knowledge Check

1. What is the difference between recognize_once and continuous recognition?

2. What audio format does the Speech SDK expect for file input?

3. When should you use batch transcription instead of real-time recognition?

4. What does diarization provide in speech-to-text?

5. How do you handle the CancellationReason.Error in speech recognition?

Cleanup

az group delete --name rg-ai102-speech --yes --no-wait

Learn More