Desafio 35: Text-to-Speech e SSML
Tempo Estimado
45 min | Custo: $2-5 (estimado) | Domínio: Implementar Soluções de NLP (15-20%)
Habilidades do exame abordadas
- Implementar síntese de text-to-speech
- Melhorar a saída de fala com SSML (Speech Synthesis Markup Language)
- Configurar seleção de voz e formatos de saída de áudio
Visão Geral
O Azure Text-to-Speech (TTS) converte texto em áudio com som natural:
| Recurso | Descrição |
|---|---|
| Vozes neurais | Vozes geradas por IA (400+ em 140 idiomas) |
| SSML | Marcação XML para controlar prosódia, ênfase, pausas |
| Formatos de áudio | WAV, MP3, OGG, PCM raw |
| Viseme | Dados de posição da boca para animação de avatar |
| Custom Neural Voice | Treine uma voz única (requer aprovação) |
Elementos SSML: <speak>, <voice>, <prosody>, <emphasis>, <break>, <say-as>, <phoneme>
Pré-requisitos
- Assinatura do Azure
- Recurso Azure Speech
- Python 3.9+ ou .NET 8
- Pacote:
azure-cognitiveservices-speech(v1.38+) - Dispositivo de saída de áudio (alto-falante) ou saída em arquivo
Implementação
Tarefa 1: Text-to-Speech Básico
- Python SDK
- C# SDK
import os
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(
subscription=os.environ["AZURE_SPEECH_KEY"],
region=os.environ["AZURE_SPEECH_REGION"]
)
# Set voice
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"
# Set output format
speech_config.set_speech_synthesis_output_format(
speechsdk.SpeechSynthesisOutputFormat.Audio16Khz32KBitRateMonoMp3
)
# Synthesize to audio file
audio_config = speechsdk.audio.AudioOutputConfig(filename="output.mp3")
synthesizer = speechsdk.SpeechSynthesizer(
speech_config=speech_config,
audio_config=audio_config
)
text = "Welcome to Azure AI Services. Today we'll explore text-to-speech capabilities."
result = synthesizer.speak_text_async(text).get()
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print(f"Speech synthesized successfully!")
print(f"Audio duration: {result.audio_duration / 10_000_000:.2f} seconds")
print(f"Audio length: {len(result.audio_data)} bytes")
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation = result.cancellation_details
print(f"Synthesis canceled: {cancellation.reason}")
print(f"Error: {cancellation.error_details}")
using Microsoft.CognitiveServices.Speech;
var speechConfig = SpeechConfig.FromSubscription(
Environment.GetEnvironmentVariable("AZURE_SPEECH_KEY"),
Environment.GetEnvironmentVariable("AZURE_SPEECH_REGION"));
speechConfig.SpeechSynthesisVoiceName = "en-US-JennyNeural";
using var audioConfig = AudioConfig.FromWavFileOutput("output.wav");
using var synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
var result = await synthesizer.SpeakTextAsync("Welcome to Azure AI text-to-speech.");
if (result.Reason == ResultReason.SynthesizingAudioCompleted)
Console.WriteLine($"Audio synthesized: {result.AudioData.Length} bytes");
Tarefa 2: SSML para Controle Avançado de Fala
- Python SDK
- C# SDK
# Synthesize using SSML for fine control
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<prosody rate="-10%" pitch="+5%">
Welcome to the Azure AI certification prep.
</prosody>
<break time="500ms"/>
<emphasis level="strong">
This is an important concept.
</emphasis>
<break time="300ms"/>
<prosody volume="soft" rate="slow">
Let me explain it step by step.
</prosody>
<break time="500ms"/>
<!-- Pronunciation control -->
The API version is <say-as interpret-as="characters">3.0</say-as>.
<break time="200ms"/>
<!-- Date and number formatting -->
The release date is <say-as interpret-as="date" format="mdy">01/15/2024</say-as>.
The cost is <say-as interpret-as="currency" language="en-US">$2.50</say-as>.
</voice>
</speak>
"""
# Use SSML synthesis
audio_config = speechsdk.audio.AudioOutputConfig(filename="ssml-output.mp3")
synthesizer = speechsdk.SpeechSynthesizer(
speech_config=speech_config,
audio_config=audio_config
)
result = synthesizer.speak_ssml_async(ssml).get()
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print(f"SSML synthesis complete: {len(result.audio_data)} bytes")
print(f"Duration: {result.audio_duration / 10_000_000:.2f}s")
string ssml = @"
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis'
xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'>
<voice name='en-US-JennyNeural'>
<prosody rate='-10%' pitch='+5%'>
Welcome to Azure AI certification prep.
</prosody>
<break time='500ms'/>
<emphasis level='strong'>This is important.</emphasis>
</voice>
</speak>";
var result = await synthesizer.SpeakSsmlAsync(ssml);
Console.WriteLine($"SSML result: {result.Reason}, {result.AudioData.Length} bytes");
Tarefa 3: Listar Vozes Disponíveis
- Python SDK
- REST API
# List all available voices
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
voices_result = synthesizer.get_voices_async("en-US").get()
if voices_result.reason == speechsdk.ResultReason.VoicesListRetrieved:
print(f"Available en-US voices ({len(voices_result.voices)}):")
for voice in voices_result.voices[:10]:
print(f" {voice.short_name}: {voice.local_name} "
f"({voice.gender.name}, {voice.voice_type.name})")
if voice.style_list:
print(f" Styles: {', '.join(voice.style_list)}")
SPEECH_KEY="<key>"
REGION="eastus2"
# Get access token
TOKEN=$(curl -s "https://${REGION}.api.cognitive.microsoft.com/sts/v1.0/issueToken" \
-H "Ocp-Apim-Subscription-Key: ${SPEECH_KEY}" -X POST)
# List voices
curl -s "https://${REGION}.tts.speech.microsoft.com/cognitiveservices/voices/list" \
-H "Authorization: Bearer ${TOKEN}" | jq '[.[] | select(.Locale=="en-US")] | .[0:5] | .[] | {ShortName, Gender, VoiceType}'
# Synthesize with SSML
curl -s "https://${REGION}.tts.speech.microsoft.com/cognitiveservices/v1" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/ssml+xml" \
-H "X-Microsoft-OutputFormat: audio-16khz-32kbitrate-mono-mp3" \
-d '<speak version="1.0" xml:lang="en-US"><voice name="en-US-JennyNeural">Hello world</voice></speak>' \
--output output.mp3
Saída Esperada
Speech synthesized successfully!
Audio duration: 4.82 seconds
Audio length: 77120 bytes
SSML synthesis complete: 98304 bytes
Duration: 8.15s
Available en-US voices (148):
en-US-JennyNeural: Jenny (Female, Neural)
Styles: assistant, chat, customerservice, newscast, angry, cheerful, sad
en-US-GuyNeural: Guy (Male, Neural)
Styles: newscast, angry, cheerful, sad
en-US-AriaNeural: Aria (Female, Neural)
Styles: chat, customerservice, narration-professional
Quebra & conserta
| Cenário | Sintoma | Causa Raiz | Correção |
|---|---|---|---|
| Sem saída de áudio | 0 bytes gerados | Nome de voz inválido | Use o short_name exato da lista de vozes |
| Erro de parse SSML | Síntese cancelada | XML malformado | Valide a estrutura SSML; verifique URIs de namespace |
| Voz não encontrada | Erro de cancelamento | Voz não disponível na região | Verifique disponibilidade da voz por região |
| Qualidade de áudio ruim | Som robótico | Usando vozes standard antigas | Mude para vozes Neural (sufixo *Neural) |
| Arquivo muito grande | Bytes de áudio excessivos | Formato de saída errado | Use formato comprimido (MP3/OGG) em vez de PCM raw |
Verificação de Conhecimento
1. Qual é o propósito do elemento SSML <prosody>?
2. Qual método você usa para sintetizar fala a partir de SSML?
3. O que o elemento SSML <say-as> faz?
4. Quais formatos de saída de áudio estão disponíveis para text-to-speech?
5. Como você insere uma pausa na fala sintetizada?
Limpeza
az group delete --name rg-ai102-speech --yes --no-wait