Skip to main content

Challenge 35: Text-to-Speech and SSML

Estimated Time

45 min | Cost: $2-5 (estimated) | Domain: Implement NLP Solutions (15-20%)

Exam skills covered

  • Implement text-to-speech synthesis
  • Improve speech output with SSML (Speech Synthesis Markup Language)
  • Configure voice selection and audio output formats

Overview

Azure Text-to-Speech (TTS) converts text into natural-sounding audio:

FeatureDescription
Neural voicesAI-generated voices (400+ across 140 languages)
SSMLXML markup for controlling prosody, emphasis, pauses
Audio formatsWAV, MP3, OGG, raw PCM
VisemeMouth position data for avatar animation
Custom Neural VoiceTrain a unique voice (requires approval)

SSML elements: <speak>, <voice>, <prosody>, <emphasis>, <break>, <say-as>, <phoneme>

Prerequisites

  • Azure subscription
  • Azure Speech resource
  • Python 3.9+ or .NET 8
  • Package: azure-cognitiveservices-speech (v1.38+)
  • Audio output device (speaker) or file output

Implementation

Task 1: Basic Text-to-Speech

import os
import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)

# Set voice
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"

# Set output format
speech_config.set_speech_synthesis_output_format(
speechsdk.SpeechSynthesisOutputFormat.Audio16Khz32KBitRateMonoMp3
)

# Synthesize to audio file
audio_config = speechsdk.audio.AudioOutputConfig(filename="output.mp3")
synthesizer = speechsdk.SpeechSynthesizer(
speech_config=speech_config,
audio_config=audio_config
)

text = "Welcome to Azure AI Services. Today we'll explore text-to-speech capabilities."
result = synthesizer.speak_text_async(text).get()

if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print(f"Speech synthesized successfully!")
print(f"Audio duration: {result.audio_duration / 10_000_000:.2f} seconds")
print(f"Audio length: {len(result.audio_data)} bytes")
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation = result.cancellation_details
print(f"Synthesis canceled: {cancellation.reason}")
print(f"Error: {cancellation.error_details}")

Task 2: SSML for Advanced Speech Control

# Synthesize using SSML for fine control
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<prosody rate="-10%" pitch="+5%">
Welcome to the Azure AI certification prep.
</prosody>

<break time="500ms"/>

<emphasis level="strong">
This is an important concept.
</emphasis>

<break time="300ms"/>

<prosody volume="soft" rate="slow">
Let me explain it step by step.
</prosody>

<break time="500ms"/>

<!-- Pronunciation control -->
The API version is <say-as interpret-as="characters">3.0</say-as>.

<break time="200ms"/>

<!-- Date and number formatting -->
The release date is <say-as interpret-as="date" format="mdy">01/15/2024</say-as>.
The cost is <say-as interpret-as="currency" language="en-US">$2.50</say-as>.
</voice>
</speak>
"""

# Use SSML synthesis
audio_config = speechsdk.audio.AudioOutputConfig(filename="ssml-output.mp3")
synthesizer = speechsdk.SpeechSynthesizer(
speech_config=speech_config,
audio_config=audio_config
)

result = synthesizer.speak_ssml_async(ssml).get()

if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print(f"SSML synthesis complete: {len(result.audio_data)} bytes")
print(f"Duration: {result.audio_duration / 10_000_000:.2f}s")

Task 3: List Available Voices

# List all available voices
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
voices_result = synthesizer.get_voices_async("en-US").get()

if voices_result.reason == speechsdk.ResultReason.VoicesListRetrieved:
print(f"Available en-US voices ({len(voices_result.voices)}):")
for voice in voices_result.voices[:10]:
print(f" {voice.short_name}: {voice.local_name} "
f"({voice.gender.name}, {voice.voice_type.name})")
if voice.style_list:
print(f" Styles: {', '.join(voice.style_list)}")

Expected Output

Speech synthesized successfully!
Audio duration: 4.82 seconds
Audio length: 77120 bytes

SSML synthesis complete: 98304 bytes
Duration: 8.15s

Available en-US voices (148):
en-US-JennyNeural: Jenny (Female, Neural)
Styles: assistant, chat, customerservice, newscast, angry, cheerful, sad
en-US-GuyNeural: Guy (Male, Neural)
Styles: newscast, angry, cheerful, sad
en-US-AriaNeural: Aria (Female, Neural)
Styles: chat, customerservice, narration-professional

Break & fix

ScenarioSymptomRoot CauseFix
No audio output0 bytes generatedInvalid voice nameUse exact short_name from voices list
SSML parse errorSynthesis canceledMalformed XMLValidate SSML structure; check namespace URIs
Voice not foundCancellation errorVoice not available in regionCheck voice availability per region
Audio quality poorRobotic soundUsing old standard voicesSwitch to Neural voices (*Neural suffix)
Large file sizeExcessive audio bytesWrong output formatUse compressed format (MP3/OGG) instead of raw PCM

Knowledge Check

1. What is the purpose of the <prosody> SSML element?

2. Which method do you use to synthesize speech from SSML?

3. What does the <say-as> SSML element do?

4. What audio output formats are available for text-to-speech?

5. How do you insert a pause in synthesized speech?

Cleanup

az group delete --name rg-ai102-speech --yes --no-wait

Learn More