Challenge 35: Text-to-Speech and SSML
Estimated Time
45 min | Cost: $2-5 (estimated) | Domain: Implement NLP Solutions (15-20%)
Exam skills covered
- Implement text-to-speech synthesis
- Improve speech output with SSML (Speech Synthesis Markup Language)
- Configure voice selection and audio output formats
Overview
Azure Text-to-Speech (TTS) converts text into natural-sounding audio:
| Feature | Description |
|---|---|
| Neural voices | AI-generated voices (400+ across 140 languages) |
| SSML | XML markup for controlling prosody, emphasis, pauses |
| Audio formats | WAV, MP3, OGG, raw PCM |
| Viseme | Mouth position data for avatar animation |
| Custom Neural Voice | Train a unique voice (requires approval) |
SSML elements: <speak>, <voice>, <prosody>, <emphasis>, <break>, <say-as>, <phoneme>
Prerequisites
- Azure subscription
- Azure Speech resource
- Python 3.9+ or .NET 8
- Package:
azure-cognitiveservices-speech(v1.38+) - Audio output device (speaker) or file output
Implementation
Task 1: Basic Text-to-Speech
- Python SDK
- C# SDK
import os
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)
# Set voice
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"
# Set output format
speech_config.set_speech_synthesis_output_format(
speechsdk.SpeechSynthesisOutputFormat.Audio16Khz32KBitRateMonoMp3
)
# Synthesize to audio file
audio_config = speechsdk.audio.AudioOutputConfig(filename="output.mp3")
synthesizer = speechsdk.SpeechSynthesizer(
speech_config=speech_config,
audio_config=audio_config
)
text = "Welcome to Azure AI Services. Today we'll explore text-to-speech capabilities."
result = synthesizer.speak_text_async(text).get()
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print(f"Speech synthesized successfully!")
print(f"Audio duration: {result.audio_duration / 10_000_000:.2f} seconds")
print(f"Audio length: {len(result.audio_data)} bytes")
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation = result.cancellation_details
print(f"Synthesis canceled: {cancellation.reason}")
print(f"Error: {cancellation.error_details}")
using Microsoft.CognitiveServices.Speech;
var speechConfig = SpeechConfig.FromSubscription(
Environment.GetEnvironmentVariable("AZURE_SPEECH_KEY"),
Environment.GetEnvironmentVariable("AZURE_SPEECH_REGION"));
speechConfig.SpeechSynthesisVoiceName = "en-US-JennyNeural";
using var audioConfig = AudioConfig.FromWavFileOutput("output.wav");
using var synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
var result = await synthesizer.SpeakTextAsync("Welcome to Azure AI text-to-speech.");
if (result.Reason == ResultReason.SynthesizingAudioCompleted)
Console.WriteLine($"Audio synthesized: {result.AudioData.Length} bytes");
Task 2: SSML for Advanced Speech Control
- Python SDK
- C# SDK
# Synthesize using SSML for fine control
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<prosody rate="-10%" pitch="+5%">
Welcome to the Azure AI certification prep.
</prosody>
<break time="500ms"/>
<emphasis level="strong">
This is an important concept.
</emphasis>
<break time="300ms"/>
<prosody volume="soft" rate="slow">
Let me explain it step by step.
</prosody>
<break time="500ms"/>
<!-- Pronunciation control -->
The API version is <say-as interpret-as="characters">3.0</say-as>.
<break time="200ms"/>
<!-- Date and number formatting -->
The release date is <say-as interpret-as="date" format="mdy">01/15/2024</say-as>.
The cost is <say-as interpret-as="currency" language="en-US">$2.50</say-as>.
</voice>
</speak>
"""
# Use SSML synthesis
audio_config = speechsdk.audio.AudioOutputConfig(filename="ssml-output.mp3")
synthesizer = speechsdk.SpeechSynthesizer(
speech_config=speech_config,
audio_config=audio_config
)
result = synthesizer.speak_ssml_async(ssml).get()
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print(f"SSML synthesis complete: {len(result.audio_data)} bytes")
print(f"Duration: {result.audio_duration / 10_000_000:.2f}s")
string ssml = @"
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis'
xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'>
<voice name='en-US-JennyNeural'>
<prosody rate='-10%' pitch='+5%'>
Welcome to Azure AI certification prep.
</prosody>
<break time='500ms'/>
<emphasis level='strong'>This is important.</emphasis>
</voice>
</speak>";
var result = await synthesizer.SpeakSsmlAsync(ssml);
Console.WriteLine($"SSML result: {result.Reason}, {result.AudioData.Length} bytes");
Task 3: List Available Voices
- Python SDK
- REST API
# List all available voices
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
voices_result = synthesizer.get_voices_async("en-US").get()
if voices_result.reason == speechsdk.ResultReason.VoicesListRetrieved:
print(f"Available en-US voices ({len(voices_result.voices)}):")
for voice in voices_result.voices[:10]:
print(f" {voice.short_name}: {voice.local_name} "
f"({voice.gender.name}, {voice.voice_type.name})")
if voice.style_list:
print(f" Styles: {', '.join(voice.style_list)}")
SPEECH_KEY="<key>"
REGION="eastus2"
# Get access token
TOKEN=$(curl -s "https://${REGION}.api.cognitive.microsoft.com/sts/v1.0/issueToken" \
-H "Ocp-Apim-Subscription-Key: ${SPEECH_KEY}" -X POST)
# List voices
curl -s "https://${REGION}.tts.speech.microsoft.com/cognitiveservices/voices/list" \
-H "Authorization: Bearer ${TOKEN}" | jq '[.[] | select(.Locale=="en-US")] | .[0:5] | .[] | {ShortName, Gender, VoiceType}'
# Synthesize with SSML
curl -s "https://${REGION}.tts.speech.microsoft.com/cognitiveservices/v1" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/ssml+xml" \
-H "X-Microsoft-OutputFormat: audio-16khz-32kbitrate-mono-mp3" \
-d '<speak version="1.0" xml:lang="en-US"><voice name="en-US-JennyNeural">Hello world</voice></speak>' \
--output output.mp3
Expected Output
Speech synthesized successfully!
Audio duration: 4.82 seconds
Audio length: 77120 bytes
SSML synthesis complete: 98304 bytes
Duration: 8.15s
Available en-US voices (148):
en-US-JennyNeural: Jenny (Female, Neural)
Styles: assistant, chat, customerservice, newscast, angry, cheerful, sad
en-US-GuyNeural: Guy (Male, Neural)
Styles: newscast, angry, cheerful, sad
en-US-AriaNeural: Aria (Female, Neural)
Styles: chat, customerservice, narration-professional
Break & fix
| Scenario | Symptom | Root Cause | Fix |
|---|---|---|---|
| No audio output | 0 bytes generated | Invalid voice name | Use exact short_name from voices list |
| SSML parse error | Synthesis canceled | Malformed XML | Validate SSML structure; check namespace URIs |
| Voice not found | Cancellation error | Voice not available in region | Check voice availability per region |
| Audio quality poor | Robotic sound | Using old standard voices | Switch to Neural voices (*Neural suffix) |
| Large file size | Excessive audio bytes | Wrong output format | Use compressed format (MP3/OGG) instead of raw PCM |
Knowledge Check
1. What is the purpose of the <prosody> SSML element?
2. Which method do you use to synthesize speech from SSML?
3. What does the <say-as> SSML element do?
4. What audio output formats are available for text-to-speech?
5. How do you insert a pause in synthesized speech?
Cleanup
az group delete --name rg-ai102-speech --yes --no-wait