Closed willwade closed 1 month ago
What is interesting is that the bytestream method calls synth_to_bytes (ie. not streaming). And its faster.. I imagine for long passages of text this wont be the case.
if we change the text to a longer stream we get
text_read = """In the age of digital transformation, technologies like artificial intelligence and machine learning
are reshaping industries at a rapid pace. The automation of processes, once considered science fiction, is now a
reality that businesses across sectors are embracing. From healthcare to finance, the impact of AI is profound,
offering solutions that improve efficiency and decision-making. However, as with any powerful technology, AI also
comes with its own set of challenges, particularly in areas of ethics, privacy, and employment. It is crucial that
we navigate these challenges thoughtfully to ensure a balanced and fair future."""
speak_streamed: 50.06 seconds
first_word_spoken: 1.00 seconds
synth_to_file: 0.92 seconds
synth_to_bytestream: 0.91 seconds
speak: 0.63 seconds
The fastest method was 'speak' with a time of 0.63 seconds.
Ah! Its the overhead in setting up the first call!
So if we run this the other way round
``python
from tts_wrapper import MicrosoftTTS, MicrosoftClient
import time
import os
from load_credentials import load_credentials
import logging
# Load credentials
load_credentials("credentials.json")
client = MicrosoftClient(
credentials=(os.getenv("MICROSOFT_TOKEN"), os.getenv("MICROSOFT_REGION"))
)
tts = MicrosoftTTS(client)
# Set up logging
logging.basicConfig(level=logging.DEBUG)
# Global variable to capture first word time
first_word_time = None
# Define a callback for when speech starts
def on_start():
global first_word_time
first_word_time = time.time()
logging.info(f"First word spoken after {first_word_time - start_time:.2f} seconds")
# Connect the on_start event to the speak_streamed method
tts.connect("onStart", on_start)
# Set TTS properties
tts.set_property("volume", "100")
tts.ssml.clear_ssml()
tts.set_property("rate", "medium")
# Prepare SSML text
text_read = f"Hello, this is a streamed test"
text_read = """In the age of digital transformation, technologies like artificial intelligence and machine learning
are reshaping industries at a rapid pace. The automation of processes, once considered science fiction, is now a
reality that businesses across sectors are embracing. From healthcare to finance, the impact of AI is profound,
offering solutions that improve efficiency and decision-making. However, as with any powerful technology, AI also
comes with its own set of challenges, particularly in areas of ethics, privacy, and employment. It is crucial that
we navigate these challenges thoughtfully to ensure a balanced and fair future."""
text_with_prosody = tts.construct_prosody_tag(text_read)
ssml_text = tts.ssml.add(text_with_prosody)
print("ssml_test: ", ssml_text)
# Dictionary to store method timings
method_timings = {}
# Measure synth_to_file time
start_time = time.time()
tts.synth_to_file(ssml_text, "test-microsoft.mp3", "mp3")
end_time = time.time()
synthfile_time = end_time - start_time
logging.info(f"synth_to_file method took {synthfile_time:.2f} seconds")
method_timings["synth_to_file"] = synthfile_time
# Measure synth_to_bytestream time
start_time = time.time()
bytestream = tts.synth_to_bytestream(ssml_text)
with open("output_testazure.mp3", "wb") as f:
f.write(bytestream.read())
end_time = time.time()
bytestream_time = end_time - start_time
logging.info(f"synth_to_bytestream method took {bytestream_time:.2f} seconds")
method_timings["synth_to_bytestream"] = bytestream_time
# Measure speak time
start_time = time.time()
tts.speak(ssml_text)
end_time = time.time()
speak_time = end_time - start_time
logging.info(f"speak method took {speak_time:.2f} seconds")
method_timings["speak"] = speak_time
start_time = time.time()
tts.speak_streamed(ssml_text)
end_time = time.time()
speakstream_time = end_time - start_time
logging.info(f"speak_streamed method took {speakstream_time:.2f} seconds")
method_timings["speak_streamed"] = speakstream_time
# Measure speak_streamed time
# Calculate first word spoken time
if first_word_time is not None:
first_word_spoken_time = first_word_time - start_time
method_timings["first_word_spoken"] = first_word_spoken_time
logging.info(f"First word spoken after {first_word_spoken_time:.2f} seconds")
# Find the fastest method
fastest_method = min(method_timings, key=method_timings.get)
fastest_time = method_timings[fastest_method]
# Pretty-print the results
print("\nMethod Timing Results:")
for method, timing in method_timings.items():
print(f"{method}: {timing:.2f} seconds")
print(
f"\nThe fastest method was '{fastest_method}' with a time of {fastest_time:.2f} seconds."
)
if "first_word_spoken" in method_timings:
print(
f"\nThe first word was spoken in the speak_streamed method after {first_word_spoken_time:.2f} seconds."
)
we get
```bash
Method Timing Results:
synth_to_file: 1.03 seconds
synth_to_bytestream: 0.77 seconds
speak: 0.60 seconds
speak_streamed: 49.97 seconds
first_word_spoken: 0.70 seconds
The fastest method was 'speak' with a time of 0.60 seconds.
The first word was spoken in the speak_streamed method after 0.70 seconds.
_ ._ __/__ _ _ _ _ _/_ Recorded: 10:53:33 Samples: 621
/_//_/// /_\ / //_// / //_'/ // Duration: 53.711 CPU time: 4.438
/ _/ v4.7.3
Program: basic_azure_example.py
53.694 <module> basic_azure_example.py:1
├─ 49.973 MicrosoftTTS.speak_streamed tts_wrapper/tts.py:218
│ ├─ 49.124 Thread.join threading.py:1064
│ │ [1 frames hidden] threading
│ │ 49.117 lock.acquire <built-in>
│ └─ 0.697 MicrosoftTTS.synth_to_bytes tts_wrapper/engines/microsoft/microsoft.py:111
│ └─ 0.658 ResultFuture.get azure/cognitiveservices/speech/speech.py:571
│ [2 frames hidden] azure
├─ 1.301 <module> tts_wrapper/__init__.py:1
│ ├─ 0.668 <module> tts_wrapper/engines/__init__.py:1
│ └─ 0.629 <module> tts_wrapper/tts.py:1
├─ 1.030 MicrosoftTTS.synth_to_file tts_wrapper/tts.py:181
│ └─ 0.943 MicrosoftTTS.synth_to_bytes tts_wrapper/engines/microsoft/microsoft.py:111
│ └─ 0.938 ResultFuture.get azure/cognitiveservices/speech/speech.py:571
│ [2 frames hidden] azure
├─ 0.767 MicrosoftTTS.synth_to_bytestream tts_wrapper/tts.py:155
│ └─ 0.765 MicrosoftTTS.synth_to_bytes tts_wrapper/engines/microsoft/microsoft.py:111
│ └─ 0.726 ResultFuture.get azure/cognitiveservices/speech/speech.py:571
│ [2 frames hidden] azure
└─ 0.600 MicrosoftTTS.speak tts_wrapper/engines/microsoft/microsoft.py:51
└─ 0.600 ResultFuture.get azure/cognitiveservices/speech/speech.py:571
[2 frames hidden] azure
Turns out the SDK is slow - but not the REST API - see https://github.com/willwade/tts-wrapper/commit/797d61795ee6376a8435108337e0c120ebdc7021
Azure is slow.. or is it?
Runnning this code
Run it with
poetry run python -m pyinstrument basic_azure_test.py
NB: So whats longer? I dont think there is much in this?