Azure OpenAI: Text-to-Speech Voice

Sofia Sondh
Feb 26, 2024
7 min read

Artificial intelligence is changing fast. One big change is text-to-speech (TTS), which turns written words into spoken ones. This has made digital platforms easier to use. Azure OpenAI Text to Speech is a type of TTS that stands out because of its advanced features and high-quality voice.

In this article, you will learn more about the voices in Azure OpenAI Text to Speech. We’ll look at what makes these voices special and how they make the user experience feel real and engaging.

What is Azure OpenAI Text to Speech?

Text-to-speech Voice Models

Use of text to Speech Voices in your Application

Azure OpenAI Text-to-Speech Features

Voice Gallery

Custom Voice

Personal Voice

Audio Content Creation

Text-to-Speech Avatar

Development and Deployment Options

Conclusion

What is Azure OpenAI Text to Speech?

Azure OpenAI Text to Speech is a powerful tool that is part of the Azure AI Services. It’s designed to convert written text into spoken words, creating a more interactive and accessible user experience.

Utilization of OpenAI Text to Speech Voices

The Azure OpenAI Text-to-Speech service uses a variety of voices for audio generation. These voices are designed to produce lifelike spoken audio from the text input.

Azure OpenAI: Text-to-Speech Voice — Azure OpenAI Text-to-Speech Voices

Here are the six OpenAI voices:

Alloy: This is one of the voices that you can choose when using the service.
Echo: Echo is another voice option provided by the service.
Fable: Fable is also included in the list of voice options.
Onyx: Onyx is another voice that you can select.
Nova: Nova is included in the voice options as well.
Shimmer: Shimmer is the last voice option provided by the service.

Text-to-Speech Conversion

The primary function of Azure OpenAI Text to Speech is to transform textual content into audible speech. This is particularly useful in a variety of applications, such as reading out loud text for visually impaired users or providing audio for interactive voice response systems.

Real-Time Synthesis

The service supports real-time synthesis. This means it can convert text to speech almost instantly, making it ideal for real-time applications.

Access and Setup

To use Azure OpenAI Text to Speech, you need an Azure subscription and access to Azure OpenAI Service in the desired Azure subscription. An Azure OpenAI resource needs to be created in the North Central US or Sweden Central regions with the tts-1 or tts-1-hd model deployed. To make a call against Azure OpenAI, you need an endpoint and a key.

Azure OpenAI Text-to-Speech Voice Models:

Azure OpenAI Text to Speech offers two model variants: Neural and NeuralHD.

Neural: This model variant is optimized for real-time use cases with the lowest latency. It’s designed to provide quick responses, making it ideal for applications that require immediate feedback, such as interactive voice response systems.
NeuralHD: This model variant is optimized for quality. It takes a bit more time to generate the speech, but the output has higher quality, making it suitable for applications where audio quality is paramount, such as audiobooks or high-quality voiceovers.

These two model variants allow developers to choose the one that best suits their specific needs.

Use of Text to speech voice in your application:

This has a wide range of applications and offers various benefits for developers:

Training Videos: The service can be used to generate voiceovers for training videos. This can make the content more engaging and accessible to a wider audience.
Live-Streaming: In live-streaming scenarios, Azure OpenAI Text to Speech can be used to convert chat messages or other text into speech, allowing streamers to interact with their audience in real time.
Human-like Voices for Chatbots: Developers can use Azure OpenAI Text to Speech to give their chatbots and virtual assistants human-like voices. This can make interactions with these tools more natural and engaging.
Audiobook or Article Narration: The service can be used to convert text from books or articles into speech, creating audiobooks or spoken articles that users can listen to.
Translation Across Multiple Languages: Azure OpenAI Text to Speech supports multiple languages, allowing developers to create applications that can speak to users in their native language.
Content Creation for Games: In gaming applications, the service can be used to generate speech for characters, narration, and other in-game audio.
Assistance to the Visually Impaired: By converting text into speech, Azure OpenAI Text to Speech can make digital content more accessible to visually impaired users.

Azure OpenAI Text-to-Speech Features:

Below are the capabilities offered by Azure OpenAI Text to speech:

Voice Gallery

The Voice Gallery in Azure OpenAI Text to Speech refers to the collection of available voices that can be used to convert written text into spoken words. These voices are designed to produce lifelike spoken audio from the text input.

The Voice gallery is available in the following regions:

North Central US
Sweden Central regions

Each voice has its personality and style, allowing developers to choose the one that best suits their specific needs.

Custom Voice

The Custom Voice feature in Azure OpenAI Text to Speech allows customers to create a personalized avatar for their product or brand. Customers can upload their video recordings of avatar talent, which the feature uses to train a synthetic video of the custom avatar speaking.

Custom voice is available in the following regions:

North Central US
Sweden Central

Customers can choose either a prebuilt or a custom neural voice for their avatar.

This capability allows developers to give human-like voices to chatbots, audiobook or article narration, translate across multiple languages, content creation for games, and offer much-needed assistance to the visually impaired.

Personal Voice

Personal Voice is a feature that allows you to create an AI-generated replication of your voice (or users of your application) in a few seconds. You provide a one-minute speech sample as the audio prompt and then use it to generate speech in any of the more than 90 languages supported across more than 100 locales.

Personal voice is available in the following regions:

West Europe
East US
South East Asia

The user’s voice characteristics are encoded in the speakerProfileId property that’s used for text-to-speech. Once you have a personal voice, you can use it to synthesize speech in any of the supported languages.

A locale tag isn’t required as Personal Voice uses automatic language detection at the sentence level.

Audio Content Creation

Audio Content Creation is a tool that allows you to optimize text-to-speech voice output by easily adjusting and fine-tuning key speech attributes. You can use the output audio as-is, or as a starting point for further customization.

Audio content creation is available in the following regions:

North Central US
Sweden Central

This tool is based on Speech Synthesis Markup Language (SSML) and allows you to adjust Text-to-speech output attributes in real-time or batch synthesis, such as voice characters, voice styles, speaking speed, pronunciation, and prosody.

It enables you to build highly natural audio content for various scenarios, such as audiobooks, news broadcasts, video narrations, and chatbots.

Text-to-Speech Avatar

The Text-to-Speech Avatar system is a text-to-speech feature with vision capabilities. It allows customers to create synthetic videos of a 2D photorealistic avatar speaking. The Neural Text to Speech Avatar models are trained by deep neural networks based on human video recording samples.

Text to speech avatar is available in the following regions:

West US 2
West Europe
Southeast Asia

There are three components in an avatar content generation workflow:

Text Analyzer
TTS Audio Synthesizer
TTS Avatar Video Synthesizer

The service offers two separate Text to Speech Avatar features:

Prebuilt Text-to-speech Avatar
custom Text to Speech Avatar.

Prebuilt Text to Speech Avatar: Microsoft offers prebuilt Text to Speech Avatars as out-of-box products on Azure for its subscribers. These avatars can speak different languages and voices based on the text input.

Custom Text to Speech Avatar: This feature enables customers to create a personalized avatar for their product or brand1. Customers can upload their video recordings of avatar talent, which the feature uses to train a synthetic video of the custom avatar speaking.

Development and Deployment Options

Azure OpenAI Text to Speech provides a variety of development and deployment options to cater to different use cases and requirements.

Development Options:

REST API: This is a set of HTTP endpoints that developers can use to interact with the service. This means that developers can send HTTP requests to these endpoints to perform certain actions (like converting text to speech), and the service will send back responses.
Speech SDK: This is a software development kit that provides libraries and tools that help developers integrate the service into their applications. It supports multiple programming languages, which means developers can use it regardless of the language they’re using to write their applications.
Speech CLI: This is a command-line interface that allows you to interact with the service using commands in a terminal or command prompt. This can be particularly useful for testing or for automating tasks with scripts.

Deployment Options:

Cloud: The service can be deployed in the cloud, allowing you to access it from anywhere with an internet connection. This eliminates the need for you to manage any hardware or infrastructure, making it a convenient and scalable option.
Embedded: This option allows you to deploy the service on your hardware, such as a local server or a device. This gives you more control over the service and its performance, but it also means that you’re responsible for managing and maintaining the hardware.
Hybrid: This option combines cloud and embedded deployment. Some components of the service are deployed in the cloud, while others are deployed on your hardware. This gives you the flexibility to choose where each component is hosted based on your specific needs.
Containers: This option allows you to deploy the service in containers, which are standalone, executable packages that include everything needed to run the service. This includes the code, runtime, system tools, libraries, and settings. Containers are isolated from each other and bundle their software, libraries, and configuration files; they can communicate with each other through well-defined channels. This makes it easy to move the service between different environments, and it ensures that the service runs the same way regardless of where it’s hosted.

Conclusion

In this journey of exploring Azure OpenAI Text-to-Speech voices, we’ve seen how this technology is revolutionizing the way we interact with digital platforms. The lifelike voices, the flexibility of choosing between different voice options, and the ability to deploy in various environments make Azure OpenAI Text to Speech a powerful tool in the realm of artificial intelligence.