Contributions are welcome! Check our contribution guide.
TTS-Wrapper simplifies using text-to-speech APIs by providing a unified interface across multiple services, allowing easy integration and manipulation of TTS capabilities.
Engine | OS | Online/Offline | SSML | Rate/Volume/Pitch | onWord events |
---|---|---|---|---|---|
Polly | Linux/MacOS/Windows | Online | Yes | Yes | Yes |
Linux/MacOS/Windows | Online | Yes | Yes | Yes | |
Azure | Linux/MacOS/Windows | Online | Yes | Yes | Yes |
Watson | Linux/MacOS/Windows | Online | Yes | No | Yes |
ElevenLabs | Linux/MacOS/Windows | Online | No | Yes | Yes |
Wit.AI | Linux/MacOS/Windows | Online | Yes | No | No |
Sherpa-Onnx | Linux/MacOS/Windows | Offline | No | No | No |
gTTS | Linux/MacOS/Windows | Online | No | No | No |
UWP | Windows | Offline | No | Yes | No |
SAPI | Windows | Offline | Yes | Yes | Yes |
NSS | MacOS | Offline | Yes | Yes | Yes |
eSpeak | Linux/MacOS/Windows | Offline | No | Yes | No |
Method | Description | Available Engines |
---|---|---|
speak() |
Plays synthesized speech directly. | All engines |
synth_to_file() |
Synthesizes speech and saves it to a file. | All engines |
speak_streamed() |
Streams synthesized speech. | All engines |
set_property() |
Sets properties like rate, volume, pitch. | All engines |
get_voices() |
Retrieves available voices. | All engines |
connect() |
Connects callback functions for events. | Polly, Microsoft, Google, Watson. |
pause_audio() |
Pauses ongoing speech playback. | All engines |
resume_audio() |
Resumes paused speech playback. | All engines |
stop_audio() |
Stops ongoing speech playback. | All engines |
set_output_device('id') |
Stops ongoing speech playback. | All engines |
check_credentials() |
True or False if Credentials are ok | All engines |
Notes:
This project requires the following system dependencies on Linux:
sudo apt-get insall portaudio19-dev
or MacOS, using Homebrew
brew install portaudio
For PicoTTS on Debian systems:
sudo apt-get install libttspico-utils
pip install py3-tts-wrapper[google,microsoft,sapi,sherpaonnx,googletrans]
or via git
pip install git+https://github.com/willwade/tts-wrapper#egg=tts-wrapper[google,microsoft,sapi,mms,sherpaonnx]
or (the newer way we should all use)
pip install tts-wrapper[google,microsoft,sapi,sherpaonnx,googletrans]@git+https://github.com/willwade/tts-wrapper
NB: On MacOS(/zsh) you may need to do use quotes
pip install py3-tts-wrapper"[google, watson, polly, elevenlabs, microsoft, mms, sherpaonnx]"
from tts_wrapper import PollyClient
pollyClient = PollyClient(credentials=('aws_key_id', 'aws_secret_access_key'))
from tts_wrapper import PollyTTS
tts = PollyTTS(pollyClient)
ssml_text = tts.ssml.add('Hello, <break time="500ms"/> world!')
tts.speak(ssml_text)
You can use SSML or plain text
from tts_wrapper import PollyClient
pollyClient = PollyClient(credentials=('aws_key_id', 'aws_secret_access_key'))
from tts_wrapper import PollyTTS
tts = PollyTTS(pollyClient)
tts.speak('Hello world')
For a full demo see the examples folder. You'll need to fill out the credentials.json (or credentials-private.json). Use them from cd'ing into the examples folder. Tips on gaining keys are below.
Each service uses different methods for authentication:
from tts_wrapper import PollyTTS, PollyClient
client = PollyClient(credentials=('aws_region','aws_key_id', 'aws_secret_access_key'))
tts = PollyTTS(client)
from tts_wrapper import GoogleTTS, GoogleClient
client = GoogleClient(credentials=('path/to/creds.json'))
tts = GoogleTTS(client)
or pass the auth file as dict - so in memory
from tts_wrapper import GoogleTTS, GoogleClient
with open(os.getenv("GOOGLE_CREDS_PATH"), "r") as file:
credentials_dict = json.load(file)
client = GoogleClient(credentials=os.getenv('GOOGLE_CREDS_PATH'))
client = GoogleClient(credentials=credentials_dict)]
from tts_wrapper import MicrosoftTTS, MicrosoftClient
client = MicrosoftClient(credentials=('subscription_key','subscription_region'))
tts = MicrosoftTTS(client)
from tts_wrapper import WatsonTTS, WatsonClient
client = WatsonClient(credentials=('api_key', 'region', 'instance_id'))
tts = WatsonTTS(client)
Note If you have issues with SSL certification try
from tts_wrapper import WatsonTTS, WatsonClient
client = WatsonClient(credentials=('api_key', 'region', 'instance_id'),disableSSLVerification=True)
tts = WatsonTTS(client)
from tts_wrapper import ElevenLabsTTS, ElevenLabsClient
client = ElevenLabsClient(credentials=('api_key'))
tts = ElevenLabsTTS(client)
from tts_wrapper import WitAiTTS, WitAiClient
client = WitAiClient(credentials=('token'))
tts = WitAiTTS(client)
from tts_wrapper import UWPTTS, UWPClient
client = UWPClient()
tts = UWPTTS(client)
from tts_wrapper import SystemTTSClient, SystemTTS
client = SystemTTSClient('espeak') # eSpeak
client = SystemTTSClient('sapi') #SAPI
client = SystemTTSClient('nsss') #NSSS MacOS
# Initialize the TTS engine
tts = SystemTTSClient(client)
Just note: We cant do word timings in this.
Uses the gTTS library.
from tts_wrapper import GoogleTransClient, GoogleTransTTS
voice_id = "en-co.uk" # Example voice ID for UK English
client = GoogleTransClient(voice_id)
# Initialize the TTS engine
tts = GoogleTransTTS(client)
You can provide blank model path and tokens path - and we will use a default location.. AS NOTED - WE HAVE DESIGNED THIS RIGHT NOW FOR MMS MODELS! We will add others like piper etc to this - Infact I'll drop regular piper support for sherpa-onnx. Its less of a headache..
from tts_wrapper import SherpaOnnxClient, SherpaOnnxTTS
client = SherpaOnnxClient(model_path=None, tokens_path=None)
tts = SherpaOnnxTTS(client)
Set a voice like
# Find voices/langs availables
voices = tts.get_voices()
print("Available voices:", voices)
# Set the voice using ISO code
iso_code = "eng" # Example ISO code for the voice - also ID in voice details
tts.set_voice(iso_code)
and then use speak, speak_streamed etc..
You then can perform the following methods.
Even if you don't use SSML features that much its wise to use the same syntax - so pass SSML not text to all engines
ssml_text = tts.ssml.add('Hello world!')
If you want to keep things simple each engine will convert plain text to SSML if its not.
tts.speak('Hello World!')
This will use the default audio output of your device to play the audio immediately
tts.speak(ssml_text)
This will check if the credentials are valid. Its only on the client object. Eg
client = MicrosoftClient(
credentials=(os.getenv("MICROSOFT_TOKEN"), os.getenv("MICROSOFT_REGION"))
)
if client.check_credentials():
print("Credentials are valid.")
else:
print("Credentials are invalid."
NB: Each engine has a different way of checking credentials. If they dont have a supported the parent class will check get_voices. If you want to save calls just do a get_voices call.
pause_audio()
, resume_audio()
, stop_audio()
These methods manage audio playback by pausing, resuming, or stopping it. NB: Only to be used for speak_streamed
You need to make sure the optional dependency is included for this
pip install py3-tts-wrapper[controlaudio,google.. etc
then
client = GoogleClient(..)
tts = GoogleTTS(client)
try:
text = "This is a pause and resume test. The text will be longer, depending on where the pause and resume works"
audio_bytes = tts.synth_to_bytes(text)
tts.load_audio(audio_bytes)
print("Play audio for 3 seconds")
tts.play(1)
tts.pause(8)
tts.resume()
time.sleep(6)
finally:
tts.cleanup()
NB: to do this we use pyaudio. If you have issues with this you may need to install portaudio19-dev - particularly on linux
sudo apt-get install portaudio19-dev
tts.synth_to_file(ssml_text, 'output.mp3', format='mp3')
there is also "synth" method which is legacy. Note we support saving as mp3, wav or flac.
tts.synth('<speak>Hello, world!</speak>', 'hello.mp3', format='mp3)
Note you can also stream - and save. Just note it saves at the end of streaming entirely..
ssml_text = tts.ssml.add('Hello world!')
tts.speak_streamed(ssml_text,filepath,'wav')
voices = tts.get_voices()
print(voices)
NB: All voices will have a id, dict of language_codes, name and gender. Just note not all voice engines provide gender
tts.set_voice(voice_id,lang_code=en-US)
e.g.
tts.set_voice('en-US-JessaNeural','en-US')
Use the id - not a name
ssml_text = tts.ssml.add('Hello, <break time="500ms"/> world!')
tts.speak(ssml_text)
Set volume:
tts.set_property("volume", "90")
text_read = f"The current volume is 90"
text_with_prosody = tts.construct_prosody_tag(text_read)
ssml_text = tts.ssml.add(text_with_prosody)
Set rate:
tts.set_property("rate", "slow")
text_read = f"The current rate is SLOW"
text_with_prosody = tts.construct_prosody_tag(text_read)
ssml_text = tts.ssml.add(text_with_prosody)
Speech Rate:
Set pitch:
tts.set_property("pitch", "high")
text_read = f"The current pitch is SLOW"
text_with_prosody = tts.construct_prosody_tag(text_read)
ssml_text = tts.ssml.add(text_with_prosody)
Pitch Control:
Use the tts.ssml.clear_ssml()
method to clear all entries from the ssml list
set_property()
This method allows setting properties like rate
, volume
, and pitch
.
tts.set_property("rate", "fast")
tts.set_property("volume", "80")
tts.set_property("pitch", "high")
get_property()
This method retrieves the value of properties such as volume
, rate
, or pitch
.
current_volume = tts.get_property("volume")
print(f"Current volume: {current_volume}")
Note only Polly, Microsoft, Google, ElevenLabs, UWP, SAPI and Watson can do this correctly. We can't do this in anything else but we do do a estimated tonings for all other engines (ie elevenlabs, witAi and Piper)
def my_callback(word: str, start_time: float, end_time: float):
duration = end_time - start_time
print(f"Word: {word}, Duration: {duration:.3f}s")
def on_start():
print('Speech started')
def on_end():
print('Speech ended')
try:
text = "Hello, This is a word timing test"
ssml_text = tts.ssml.add(text)
tts.connect('onStart', on_start)
tts.connect('onEnd', on_end)
tts.start_playback_with_callbacks(ssml_text, callback=my_callback)
except Exception as e:
print(f"Error: {e}")
and it will output
Speech started
Word: Hello, Duration: 0.612s
Word: , Duration: 0.212s
Word: This, Duration: 0.364s
Word: is, Duration: 0.310s
Word: a, Duration: 0.304s
Word: word, Duration: 0.412s
Word: timing, Duration: 0.396s
Word: test, Duration: 0.424s
Speech ended
connect()
This method allows registering callback functions for events like onStart
or onEnd
.
def on_start():
print("Speech started")
tts.connect('onStart', on_start)
By default, all engines output audio in the WAV format, but can be configured to output MP3 or other formats where supported.
tts.synth('<speak>Hello, world!</speak>', 'hello.mp3', format='mp3)
The synth_to_bytestream
method is designed to synthesize text into an in-memory bytestream in the specified audio format (wav
, mp3
, flac
, etc.). It is particularly useful when you want to handle the audio data in-memory for tasks like saving it to a file, streaming the audio, or passing it to another system for processing.
def synth_to_bytestream(self, text: Any, format: Optional[str] = "wav") -> BytesIO:
"""
Synthesizes text to an in-memory bytestream in the specified audio format.
:param text: The text to synthesize.
:param format: The audio format (e.g., 'wav', 'mp3', 'flac'). Default: 'wav'.
:return: A BytesIO object containing the audio data.
"""
wav
. Supported formats include wav
, mp3
, and flac
.BytesIO
object containing the audio data in the requested format. This can be used directly to save to a file or for playback in real-time.You can use the synth_to_bytestream
method to synthesize audio in any supported format and save it directly to a file.
# Synthesize text into a bytestream in MP3 format
bytestream = tts.synth_to_bytestream("Hello, this is a test", format="mp3")
# Save the audio bytestream to a file
with open("output.mp3", "wb") as f:
f.write(bytestream.read())
print("Audio saved to output.mp3")
Explanation:
BytesIO
object is then written to a file using the .read()
method of the BytesIO
class.sounddevice
If you want to play the synthesized audio live without saving it to a file, you can use the sounddevice
library to directly play the audio from the BytesIO
bytestream.
import sounddevice as sd
import numpy as np
# Synthesize text into a bytestream in WAV format
bytestream = tts.synth_to_bytestream("Hello, this is a live playback test", format="wav")
# Convert the bytestream back to raw PCM audio data for playback
audio_data = np.frombuffer(bytestream.read(), dtype=np.int16)
# Play the audio using sounddevice
sd.play(audio_data, samplerate=tts.audio_rate)
sd.wait()
print("Live playback completed")
Explanation:
wav
bytestream.np.frombuffer()
, which is then fed into the sounddevice
library for live playback.sd.play()
plays the audio in real-time, and sd.wait()
ensures that the program waits until playback finishes.Clone the repository:
git clone https://github.com/willwade/tts-wrapper.git
cd tts-wrapper
Install the package and system dependencies:
pip install .
To install optional dependencies, use:
pip install .[google, watson, polly, elevenlabs, microsoft]
This will install Python dependencies and system dependencies required for this project. Note that system dependencies will only be installed automatically on Linux.
Clone the repository:
git clone https://github.com/willwade/tts-wrapper.git
cd tts-wrapper
Install Python dependencies:
poetry install
Install system dependencies (Linux only):
poetry run postinstall
NOTE: to get a requirements.txt file for the project use poetry export --without-hashes --format=requirements.txt > requirements.txt --all-extras
juat be warned that this will include all dependencies including dev ones.
git tag -a v0.1.0 -m "Release 0.1.0"
git push origin v0.1.0
This guide provides a step-by-step approach to adding a new engine to the existing Text-to-Speech (TTS) wrapper system.
Create a new folder for your engine within the engines
directory. Name this folder according to your engine, such as witai
for Wit.ai.
Directory structure:
engines/witai/
Create necessary files within this new folder:
__init__.py
- Makes the directory a Python package.client.py
- Handles all interactions with the TTS API.engine.py
- Contains the TTS class that integrates with your abstract TTS system.ssml.py
- Defines any SSML handling specific to this engine.Final directory setup:
engines/
└── witai/
├── __init__.py
├── client.py
├── engine.py
└── ssml.py
client.py
Implement authentication and necessary setup for API connection. This file should manage tasks such as sending synthesis requests and fetching available voices.
class TTSClient:
def __init__(self, api_key):
self.api_key = api_key
# Setup other necessary API connection details here
def synth(self, text, options):
# Code to send a synthesis request to the TTS API
pass
def get_voices(self):
# Code to retrieve available voices from the TTS API
pass
engine.py
This class should inherit from the abstract TTS class and implement required methods such as get_voices
and synth_to_bytes
.
from .client import TTSClient
from your_tts_module.abstract_tts import AbstractTTS
class WitTTS(AbstractTTS):
def __init__(self, api_key):
super().__init__()
self.client = TTSClient(api_key)
def get_voices(self):
return self.client.get_voices()
def synth_to_bytes(self, text, format='wav'):
return self.client.synth(text, {'format': format})
ssml.py
If the engine has specific SSML requirements or supports certain SSML tags differently, implement this logic here.
from your_tts_module.abstract_ssml import BaseSSMLRoot, SSMLNode
class EngineSSML(BaseSSMLRoot):
def add_break(self, time='500ms'):
self.root.add(SSMLNode('break', attrs={'time': time}))
__init__.py
Make sure the __init__.py
file properly imports and exposes the TTS class and any other public classes or functions from your engine.
from .engine import WitTTS
from .ssml import EngineSSML
This is not straightforward
Create a Service Account:
Bearer
token. Its in the Curl exampleThis project is licensed under the MIT License.