Challenge 34: Speech-to-Text
Estimated Time
50 min | Cost: $2-5 (estimated) | Domain: Implement NLP Solutions (15-20%)
Exam skills covered
- Implement speech-to-text transcription
- Configure real-time and batch transcription
- Implement custom speech models for domain-specific vocabulary
Overview
Azure Speech service provides speech-to-text (STT) capabilities:
| Mode | Description | Use Case |
|---|---|---|
| Real-time | Continuous recognition from mic/stream | Live captions, voice commands |
| Batch | Async transcription of audio files | Meeting recordings, call centers |
| Custom Speech | Models trained on your vocabulary | Medical, legal, technical domains |
Key classes: SpeechConfig, SpeechRecognizer, AudioConfig
Prerequisites
- Azure subscription
- Azure Speech resource
- Python 3.9+ or .NET 8
- Package:
azure-cognitiveservices-speech(v1.38+) - Microphone (for real-time) or audio file (.wav)
Implementation
Task 1: Create Speech Resource
az group create --name rg-ai102-speech --location eastus2
az cognitiveservices account create \
--name speech-ai102 \
--resource-group rg-ai102-speech \
--kind SpeechServices \
--sku S0 \
--location eastus2
SPEECH_KEY=$(az cognitiveservices account keys list --name speech-ai102 --resource-group rg-ai102-speech --query key1 -o tsv)
SPEECH_REGION="eastus2"
Task 2: Real-Time Speech Recognition
- Python SDK
- C# SDK
import os
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)
speech_config.speech_recognition_language = "en-US"
# Option 1: Recognize from audio file
audio_config = speechsdk.audio.AudioConfig(filename="meeting-recording.wav")
recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
# Single utterance recognition
print("Recognizing from file...")
result = recognizer.recognize_once()
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
print(f"Recognized: {result.text}")
print(f"Duration: {result.duration / 10_000_000:.2f} seconds")
elif result.reason == speechsdk.ResultReason.NoMatch:
print(f"No speech recognized: {result.no_match_details}")
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation = result.cancellation_details
print(f"Canceled: {cancellation.reason}")
if cancellation.reason == speechsdk.CancellationReason.Error:
print(f"Error: {cancellation.error_details}")
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
var speechConfig = SpeechConfig.FromSubscription(
Environment.GetEnvironmentVariable("AZURE_SPEECH_KEY"),
Environment.GetEnvironmentVariable("AZURE_SPEECH_REGION"));
speechConfig.SpeechRecognitionLanguage = "en-US";
using var audioConfig = AudioConfig.FromWavFileInput("meeting-recording.wav");
using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);
var result = await recognizer.RecognizeOnceAsync();
switch (result.Reason)
{
case ResultReason.RecognizedSpeech:
Console.WriteLine($"Recognized: {result.Text}");
break;
case ResultReason.NoMatch:
Console.WriteLine("No speech recognized.");
break;
case ResultReason.Canceled:
var cancellation = CancellationDetails.FromResult(result);
Console.WriteLine($"Canceled: {cancellation.Reason}, Error: {cancellation.ErrorDetails}");
break;
}
Task 3: Continuous Recognition (Full Meeting Transcription)
- Python SDK
- C# SDK
import threading
speech_config = speechsdk.SpeechConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)
speech_config.speech_recognition_language = "en-US"
speech_config.set_property(
speechsdk.PropertyId.SpeechServiceResponse_DiarizeIntermediateResults, "true"
)
audio_config = speechsdk.audio.AudioConfig(filename="long-meeting.wav")
recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
transcript = []
done = threading.Event()
def recognized_handler(evt):
if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
transcript.append(evt.result.text)
print(f" [{evt.result.offset / 10_000_000:.1f}s] {evt.result.text}")
def session_stopped_handler(evt):
done.set()
def canceled_handler(evt):
print(f"Canceled: {evt.cancellation_details.reason}")
done.set()
# Connect event handlers
recognizer.recognized.connect(recognized_handler)
recognizer.session_stopped.connect(session_stopped_handler)
recognizer.canceled.connect(canceled_handler)
# Start continuous recognition
print("Starting continuous recognition...")
recognizer.start_continuous_recognition()
done.wait()
recognizer.stop_continuous_recognition()
# Full transcript
print(f"\n{'='*50}")
print(f"Full transcript ({len(transcript)} segments):")
print(" ".join(transcript))
using var audioConfig = AudioConfig.FromWavFileInput("long-meeting.wav");
using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);
var transcript = new List<string>();
var stopRecognition = new TaskCompletionSource<int>();
recognizer.Recognized += (s, e) =>
{
if (e.Result.Reason == ResultReason.RecognizedSpeech)
{
transcript.Add(e.Result.Text);
Console.WriteLine($" [{e.Result.Offset.TotalSeconds:F1}s] {e.Result.Text}");
}
};
recognizer.SessionStopped += (s, e) => stopRecognition.TrySetResult(0);
recognizer.Canceled += (s, e) => stopRecognition.TrySetResult(0);
await recognizer.StartContinuousRecognitionAsync();
await stopRecognition.Task;
await recognizer.StopContinuousRecognitionAsync();
Console.WriteLine($"\nFull transcript ({transcript.Count} segments):");
Console.WriteLine(string.Join(" ", transcript));
Task 4: Batch Transcription API
- REST API
SPEECH_KEY="<your-key>"
REGION="eastus2"
# Create batch transcription job
curl -s "https://${REGION}.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions" \
-H "Ocp-Apim-Subscription-Key: ${SPEECH_KEY}" \
-H "Content-Type: application/json" \
-d '{
"contentUrls": [
"https://storage.blob.core.windows.net/audio/meeting1.wav?sv=...&sig=..."
],
"locale": "en-US",
"displayName": "Meeting Transcription",
"properties": {
"wordLevelTimestampsEnabled": true,
"diarizationEnabled": true,
"maxSpeakerCount": 5,
"punctuationMode": "DictatedAndAutomatic"
}
}' | jq '{id: .self, status: .status}'
# Check status (replace TRANSCRIPTION_URL)
curl -s "https://${REGION}.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions/<id>" \
-H "Ocp-Apim-Subscription-Key: ${SPEECH_KEY}" | jq '.status'
# Get results
curl -s "https://${REGION}.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions/<id>/files" \
-H "Ocp-Apim-Subscription-Key: ${SPEECH_KEY}" | jq '.values[].links.contentUrl'
Expected Output
Recognizing from file...
Recognized: Welcome to the quarterly business review meeting.
Duration: 3.45 seconds
Starting continuous recognition...
[0.5s] Welcome to the quarterly business review meeting.
[4.2s] Today we'll discuss our progress on key initiatives.
[8.1s] Let's start with the revenue numbers from last quarter.
[12.5s] We exceeded our target by fifteen percent.
==================================================
Full transcript (4 segments):
Welcome to the quarterly business review meeting. Today we'll discuss our progress on key initiatives. Let's start with the revenue numbers from last quarter. We exceeded our target by fifteen percent.
Break & fix
| Scenario | Symptom | Root Cause | Fix |
|---|---|---|---|
NoMatch result | No speech recognized | Audio is silence, wrong format, or wrong language | Verify WAV format (16kHz, 16-bit, mono PCM); check language setting |
Canceled with auth error | 401 Unauthorized | Wrong key or region | Verify key matches region; check resource is active |
| Truncated recognition | Only first sentence | Used recognize_once instead of continuous | Use start_continuous_recognition for long audio |
| Missing words | Incomplete transcript | Domain-specific vocabulary | Train Custom Speech model with your terminology |
| High latency | Slow results | Network or large audio chunks | Use streaming/push audio; check network connectivity |
Knowledge Check
1. What is the difference between recognize_once and continuous recognition?
2. What audio format does the Speech SDK expect for file input?
3. When should you use batch transcription instead of real-time recognition?
4. What does diarization provide in speech-to-text?
5. How do you handle the CancellationReason.Error in speech recognition?
Cleanup
az group delete --name rg-ai102-speech --yes --no-wait