Closed enla51 closed 1 year ago
I think you can get this information from the alignment matrix pred_aln_trg
at the inference notebook https://github.com/yl4579/StyleTTS/blob/main/Demo/Inference_LibriTTS.ipynb
Thank you very much!
Hi @yl4579 ,
I hope you're doing well. I'm currently working on implementing word duration and start time calculations for a speech synthesis task using your model. However, I'm encountering some difficulties in ensuring the accuracy of these calculations.
Here's a brief overview of what I'm doing:
Despite my efforts, the calculated durations and start times seem too short compared to the expected values.
Here is a snippet of my current approach:
hop_size=300
samplerate=24000
# Calculate word durations
word_durations = []
current_word_duration = 0
frame_counts = pred_aln_trg.sum(dim=1).cpu().numpy()
for token_index, token in enumerate(tokens[0]):
current_word_duration += frame_counts[token_index]
if token == 16: # Token 16 represents a space
word_durations.append(current_word_duration * hop_size / samplerate)
current_word_duration = 0
if current_word_duration > 0:
word_durations.append(current_word_duration * hop_size / samplerate)
# Calculate start times
start_times = [0]
for i in range(1, len(word_durations)):
start_times.append(start_times[-1] + word_durations[i - 1])
print(f"Word durations (seconds): {word_durations}")
print(f"Start times (seconds): {start_times}")
Despite these steps, the computed durations and start times are consistently too short. Could you provide any insights or suggestions on where I might be going wrong or how to improve the accuracy of these calculations?
I appreciate your help and look forward to your response.
Best regards, Alessandro
@alessandropettenuzzo96 Any luck on this implementation? Would highly appreciate it if you could share some insights.
@sinhprous Is there any implementation to fetch word level timestamps of generated audio? It would be really helpful.
Hi,
Is it possible to output the exact time when a token is being pronounced in the sound file? So if the input sentence is: "How are you?". Does the model output contain any information similar to: second token 'are' starts being pronounced at second 3?
Thank you very much for this amazing project!