willwade / tts-wrapper

TTS-Wrapper makes it easier to use text-to-speech APIs by providing a unified and easy-to-use interface.
MIT License
9 stars 2 forks source link

Azure problem #18

Closed willwade closed 1 month ago

willwade commented 2 months ago

Azure is slow.. or is it?

Runnning this code

from tts_wrapper import MicrosoftTTS, MicrosoftClient
import time
import os
from load_credentials import load_credentials
import logging

# Load credentials
load_credentials("credentials.json")
client = MicrosoftClient(
    credentials=(os.getenv("MICROSOFT_TOKEN"), os.getenv("MICROSOFT_REGION"))
)
tts = MicrosoftTTS(client)

# Set up logging
logging.basicConfig(level=logging.DEBUG)

# Global variable to capture first word time
first_word_time = None

# Define a callback for when speech starts
def on_start():
    global first_word_time
    first_word_time = time.time()
    logging.info(f"First word spoken after {first_word_time - start_time:.2f} seconds")

# Connect the on_start event to the speak_streamed method
tts.connect("onStart", on_start)

# Set TTS properties
tts.set_property("volume", "100")
tts.ssml.clear_ssml()
tts.set_property("rate", "medium")

# Prepare SSML text
text_read = f"Hello, this is a streamed test"
text_with_prosody = tts.construct_prosody_tag(text_read)
ssml_text = tts.ssml.add(text_with_prosody)
print("ssml_test: ", ssml_text)

# Dictionary to store method timings
method_timings = {}

# Measure speak_streamed time
start_time = time.time()
tts.speak_streamed(ssml_text)
end_time = time.time()
speakstream_time = end_time - start_time
logging.info(f"speak_streamed method took {speakstream_time:.2f} seconds")
method_timings["speak_streamed"] = speakstream_time

# Calculate first word spoken time
if first_word_time is not None:
    first_word_spoken_time = first_word_time - start_time
    method_timings["first_word_spoken"] = first_word_spoken_time
    logging.info(f"First word spoken after {first_word_spoken_time:.2f} seconds")

# Measure synth_to_file time
start_time = time.time()
tts.synth_to_file(ssml_text, "test-microsoft.mp3", "mp3")
end_time = time.time()
synthfile_time = end_time - start_time
logging.info(f"synth_to_file method took {synthfile_time:.2f} seconds")
method_timings["synth_to_file"] = synthfile_time

# Measure synth_to_bytestream time
start_time = time.time()
bytestream = tts.synth_to_bytestream(ssml_text)
with open("output_testazure.mp3", "wb") as f:
    f.write(bytestream.read())
end_time = time.time()
bytestream_time = end_time - start_time
logging.info(f"synth_to_bytestream method took {bytestream_time:.2f} seconds")
method_timings["synth_to_bytestream"] = bytestream_time

# Measure speak time
start_time = time.time()
tts.speak(ssml_text)
end_time = time.time()
speak_time = end_time - start_time
logging.info(f"speak method took {speak_time:.2f} seconds")
method_timings["speak"] = speak_time

# Find the fastest method
fastest_method = min(method_timings, key=method_timings.get)
fastest_time = method_timings[fastest_method]

# Pretty-print the results
print("\nMethod Timing Results:")
for method, timing in method_timings.items():
    print(f"{method}: {timing:.2f} seconds")

print(
    f"\nThe fastest method was '{fastest_method}' with a time of {fastest_time:.2f} seconds."
)

if "first_word_spoken" in method_timings:
    print(
        f"\nThe first word was spoken in the speak_streamed method after {first_word_spoken_time:.2f} seconds."
    )

Run it with poetry run python -m pyinstrument basic_azure_test.py

MicrosoftClient initialized with region: uksouth
ssml_test:  <speak version="1.0" xml:lang="en-US" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts"><voice name="en-US-JennyNeural"><prosody rate="medium" volume="100">Hello, this is a streamed test</prosody></voice></speak>
DEBUG:root:Word: Hello, Start: 0.050s, Duration: 0.525s
DEBUG:root:Word: ,, Start: 0.675s, Duration: 0.100s
DEBUG:root:Word: this, Start: 0.775s, Duration: 0.188s
DEBUG:root:Word: is, Start: 0.975s, Duration: 0.100s
DEBUG:root:Word: a, Start: 1.087s, Duration: 0.062s
DEBUG:root:Word: streamed, Start: 1.163s, Duration: 0.362s
DEBUG:root:Word: test, Start: 1.538s, Duration: 0.588s
INFO:root:Captured 7 word timings
INFO:root:First word spoken after 0.57 seconds
INFO:root:speak_streamed method took 4.11 seconds
INFO:root:First word spoken after 0.57 seconds
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6.
DEBUG:root:Word: Hello, Start: 0.050s, Duration: 0.525s
DEBUG:root:Word: ,, Start: 0.675s, Duration: 0.100s
DEBUG:root:Word: this, Start: 0.775s, Duration: 0.188s
DEBUG:root:Word: is, Start: 0.975s, Duration: 0.100s
DEBUG:root:Word: a, Start: 1.087s, Duration: 0.062s
DEBUG:root:Word: streamed, Start: 1.163s, Duration: 0.362s
DEBUG:root:Word: test, Start: 1.538s, Duration: 0.588s
INFO:root:Captured 7 word timings
INFO:root:synth_to_file method took 0.48 seconds
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6.
DEBUG:root:Word: Hello, Start: 0.050s, Duration: 0.525s
DEBUG:root:Word: ,, Start: 0.675s, Duration: 0.100s
DEBUG:root:Word: this, Start: 0.775s, Duration: 0.188s
DEBUG:root:Word: is, Start: 0.975s, Duration: 0.100s
DEBUG:root:Word: a, Start: 1.087s, Duration: 0.062s
DEBUG:root:Word: streamed, Start: 1.163s, Duration: 0.362s
DEBUG:root:Word: test, Start: 1.538s, Duration: 0.588s
INFO:root:Captured 7 word timings
INFO:root:synth_to_bytestream method took 0.32 seconds
DEBUG:root:Word: Hello, Start: 0.050s, Duration: 0.525s
DEBUG:root:Word: ,, Start: 0.675s, Duration: 0.100s
DEBUG:root:Word: this, Start: 0.775s, Duration: 0.188s
DEBUG:root:Word: is, Start: 0.975s, Duration: 0.100s
DEBUG:root:Word: a, Start: 1.087s, Duration: 0.062s
DEBUG:root:Word: streamed, Start: 1.163s, Duration: 0.362s
DEBUG:root:Word: test, Start: 1.538s, Duration: 0.588s
INFO:root:Speech synthesized for text [<speak version="1.0" xml:lang="en-US" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts"><voice name="en-US-JennyNeural"><prosody rate="medium" volume="100">Hello, this is a streamed test</prosody></voice></speak>]
INFO:root:speak method took 0.09 seconds

Method Timing Results:
speak_streamed: 4.11 seconds
first_word_spoken: 0.57 seconds
synth_to_file: 0.48 seconds
synth_to_bytestream: 0.32 seconds
speak: 0.09 seconds

The fastest method was 'speak' with a time of 0.09 seconds.

The first word was spoken in the speak_streamed method after 0.57 seconds.

  _     ._   __/__   _ _  _  _ _/_   Recorded: 10:28:54  Samples:  557
 /_//_/// /_\ / //_// / //_'/ //     Duration: 6.892     CPU time: 2.316
/   _/                      v4.7.3

Program: basic_azure_example.py

6.892 <module>  basic_azure_example.py:1
├─ 4.108 MicrosoftTTS.speak_streamed  tts_wrapper/tts.py:218
│  ├─ 3.415 Thread.join  threading.py:1064
│  │     [1 frames hidden]  threading
│  │        3.415 lock.acquire  <built-in>
│  ├─ 0.575 MicrosoftTTS.synth_to_bytes  tts_wrapper/engines/microsoft/microsoft.py:111
│  │  └─ 0.569 ResultFuture.get  azure/cognitiveservices/speech/speech.py:571
│  │        [2 frames hidden]  azure
│  └─ 0.117 MicrosoftTTS.setup_stream  tts_wrapper/tts.py:270
│     └─ 0.078 OutputStream.start  sounddevice.py:1111
├─ 1.876 <module>  tts_wrapper/__init__.py:1
│  ├─ 1.329 <module>  tts_wrapper/engines/__init__.py:1
│  │  └─ 1.235 <module>  tts_wrapper/engines/google/__init__.py:1
│  │     └─ 1.230 <module>  tts_wrapper/engines/google/client.py:1
│  │        └─ 1.227 <module>  google/cloud/texttospeech_v1beta1/__init__.py:1
│  │              [20 frames hidden]  google, pyasn1, <built-in>, requests,...
│  └─ 0.543 <module>  tts_wrapper/tts.py:1
│     ├─ 0.302 <module>  sounddevice.py:1
│     │     [1 frames hidden]  sounddevice
│     └─ 0.231 <module>  numpy/__init__.py:1
│           [3 frames hidden]  numpy
├─ 0.484 MicrosoftTTS.synth_to_file  tts_wrapper/tts.py:181
│  └─ 0.476 MicrosoftTTS.synth_to_bytes  tts_wrapper/engines/microsoft/microsoft.py:111
│     └─ 0.439 ResultFuture.get  azure/cognitiveservices/speech/speech.py:571
│           [2 frames hidden]  azure
├─ 0.322 MicrosoftTTS.synth_to_bytestream  tts_wrapper/tts.py:155
│  └─ 0.321 MicrosoftTTS.synth_to_bytes  tts_wrapper/engines/microsoft/microsoft.py:111
│     └─ 0.284 ResultFuture.get  azure/cognitiveservices/speech/speech.py:571
│           [2 frames hidden]  azure
└─ 0.093 MicrosoftTTS.speak  tts_wrapper/engines/microsoft/microsoft.py:51
   └─ 0.093 ResultFuture.get  azure/cognitiveservices/speech/speech.py:571
         [2 frames hidden]  azure

To view this report with different options, run:
    pyinstrument --load-prev 2024-09-17T10-28-54 [options]

Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6.

NB: So whats longer? I dont think there is much in this?

willwade commented 2 months ago

What is interesting is that the bytestream method calls synth_to_bytes (ie. not streaming). And its faster.. I imagine for long passages of text this wont be the case.

willwade commented 2 months ago

if we change the text to a longer stream we get

text_read = """In the age of digital transformation, technologies like artificial intelligence and machine learning 
are reshaping industries at a rapid pace. The automation of processes, once considered science fiction, is now a 
reality that businesses across sectors are embracing. From healthcare to finance, the impact of AI is profound, 
offering solutions that improve efficiency and decision-making. However, as with any powerful technology, AI also 
comes with its own set of challenges, particularly in areas of ethics, privacy, and employment. It is crucial that 
we navigate these challenges thoughtfully to ensure a balanced and fair future."""
speak_streamed: 50.06 seconds
first_word_spoken: 1.00 seconds
synth_to_file: 0.92 seconds
synth_to_bytestream: 0.91 seconds
speak: 0.63 seconds

The fastest method was 'speak' with a time of 0.63 seconds.
willwade commented 2 months ago

Ah! Its the overhead in setting up the first call!

So if we run this the other way round

``python

from tts_wrapper import MicrosoftTTS, MicrosoftClient
import time
import os
from load_credentials import load_credentials
import logging

# Load credentials
load_credentials("credentials.json")
client = MicrosoftClient(
    credentials=(os.getenv("MICROSOFT_TOKEN"), os.getenv("MICROSOFT_REGION"))
)
tts = MicrosoftTTS(client)

# Set up logging
logging.basicConfig(level=logging.DEBUG)

# Global variable to capture first word time
first_word_time = None

# Define a callback for when speech starts
def on_start():
    global first_word_time
    first_word_time = time.time()
    logging.info(f"First word spoken after {first_word_time - start_time:.2f} seconds")

# Connect the on_start event to the speak_streamed method
tts.connect("onStart", on_start)

# Set TTS properties
tts.set_property("volume", "100")
tts.ssml.clear_ssml()
tts.set_property("rate", "medium")

# Prepare SSML text
text_read = f"Hello, this is a streamed test"
text_read = """In the age of digital transformation, technologies like artificial intelligence and machine learning 
are reshaping industries at a rapid pace. The automation of processes, once considered science fiction, is now a 
reality that businesses across sectors are embracing. From healthcare to finance, the impact of AI is profound, 
offering solutions that improve efficiency and decision-making. However, as with any powerful technology, AI also 
comes with its own set of challenges, particularly in areas of ethics, privacy, and employment. It is crucial that 
we navigate these challenges thoughtfully to ensure a balanced and fair future."""
text_with_prosody = tts.construct_prosody_tag(text_read)
ssml_text = tts.ssml.add(text_with_prosody)
print("ssml_test: ", ssml_text)

# Dictionary to store method timings
method_timings = {}

# Measure synth_to_file time
start_time = time.time()
tts.synth_to_file(ssml_text, "test-microsoft.mp3", "mp3")
end_time = time.time()
synthfile_time = end_time - start_time
logging.info(f"synth_to_file method took {synthfile_time:.2f} seconds")
method_timings["synth_to_file"] = synthfile_time

# Measure synth_to_bytestream time
start_time = time.time()
bytestream = tts.synth_to_bytestream(ssml_text)
with open("output_testazure.mp3", "wb") as f:
    f.write(bytestream.read())
end_time = time.time()
bytestream_time = end_time - start_time
logging.info(f"synth_to_bytestream method took {bytestream_time:.2f} seconds")
method_timings["synth_to_bytestream"] = bytestream_time

# Measure speak time
start_time = time.time()
tts.speak(ssml_text)
end_time = time.time()
speak_time = end_time - start_time
logging.info(f"speak method took {speak_time:.2f} seconds")
method_timings["speak"] = speak_time

start_time = time.time()
tts.speak_streamed(ssml_text)
end_time = time.time()
speakstream_time = end_time - start_time
logging.info(f"speak_streamed method took {speakstream_time:.2f} seconds")
method_timings["speak_streamed"] = speakstream_time
# Measure speak_streamed time

# Calculate first word spoken time
if first_word_time is not None:
    first_word_spoken_time = first_word_time - start_time
    method_timings["first_word_spoken"] = first_word_spoken_time
    logging.info(f"First word spoken after {first_word_spoken_time:.2f} seconds")

# Find the fastest method
fastest_method = min(method_timings, key=method_timings.get)
fastest_time = method_timings[fastest_method]

# Pretty-print the results
print("\nMethod Timing Results:")
for method, timing in method_timings.items():
    print(f"{method}: {timing:.2f} seconds")

print(
    f"\nThe fastest method was '{fastest_method}' with a time of {fastest_time:.2f} seconds."
)

if "first_word_spoken" in method_timings:
    print(
        f"\nThe first word was spoken in the speak_streamed method after {first_word_spoken_time:.2f} seconds."
    )

we get

```bash
Method Timing Results:
synth_to_file: 1.03 seconds
synth_to_bytestream: 0.77 seconds
speak: 0.60 seconds
speak_streamed: 49.97 seconds
first_word_spoken: 0.70 seconds

The fastest method was 'speak' with a time of 0.60 seconds.

The first word was spoken in the speak_streamed method after 0.70 seconds.

  _     ._   __/__   _ _  _  _ _/_   Recorded: 10:53:33  Samples:  621
 /_//_/// /_\ / //_// / //_'/ //     Duration: 53.711    CPU time: 4.438
/   _/                      v4.7.3

Program: basic_azure_example.py

53.694 <module>  basic_azure_example.py:1
├─ 49.973 MicrosoftTTS.speak_streamed  tts_wrapper/tts.py:218
│  ├─ 49.124 Thread.join  threading.py:1064
│  │     [1 frames hidden]  threading
│  │        49.117 lock.acquire  <built-in>
│  └─ 0.697 MicrosoftTTS.synth_to_bytes  tts_wrapper/engines/microsoft/microsoft.py:111
│     └─ 0.658 ResultFuture.get  azure/cognitiveservices/speech/speech.py:571
│           [2 frames hidden]  azure
├─ 1.301 <module>  tts_wrapper/__init__.py:1
│  ├─ 0.668 <module>  tts_wrapper/engines/__init__.py:1
│  └─ 0.629 <module>  tts_wrapper/tts.py:1
├─ 1.030 MicrosoftTTS.synth_to_file  tts_wrapper/tts.py:181
│  └─ 0.943 MicrosoftTTS.synth_to_bytes  tts_wrapper/engines/microsoft/microsoft.py:111
│     └─ 0.938 ResultFuture.get  azure/cognitiveservices/speech/speech.py:571
│           [2 frames hidden]  azure
├─ 0.767 MicrosoftTTS.synth_to_bytestream  tts_wrapper/tts.py:155
│  └─ 0.765 MicrosoftTTS.synth_to_bytes  tts_wrapper/engines/microsoft/microsoft.py:111
│     └─ 0.726 ResultFuture.get  azure/cognitiveservices/speech/speech.py:571
│           [2 frames hidden]  azure
└─ 0.600 MicrosoftTTS.speak  tts_wrapper/engines/microsoft/microsoft.py:51
   └─ 0.600 ResultFuture.get  azure/cognitiveservices/speech/speech.py:571
         [2 frames hidden]  azure
willwade commented 1 month ago

Turns out the SDK is slow - but not the REST API - see https://github.com/willwade/tts-wrapper/commit/797d61795ee6376a8435108337e0c120ebdc7021