yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.65k stars 377 forks source link

ETA of training code publication #1

Closed spolezhaev closed 9 months ago

spolezhaev commented 1 year ago

Thank you for your work! Is there any ETA on when the training and inference code will become available?

danielmsu commented 11 months ago

@yl4579 Thank you, I will check it out. By the way, I generated some samples using the pretrained model, and I think they sound a bit different from the demo samples. Do you know what might be the reason? For example: wav from the demo, the voice is very clear - https://voca.ro/1cJK3PM67Flj same text generated with the inference notebook, noticeable high-pitched noise in the background - https://voca.ro/1elSCh8VuT79

yl4579 commented 11 months ago

@danielmsu I haven't encountered this problem. I have generated with the same text, and it works totally fine for me. Have you run the entire demo and does every single audio you generated sound like this? If so, it sounds like some dependencies might be messed up. It is totally normal if there is some small variation because the model is stochastic in nature, but the quality difference shouldn't be this big.

yl4579 commented 11 months ago

@danielmsu I've created a Colab notebook that you can try here: https://colab.research.google.com/drive/1k5OqSp8a-x-27xlaWr2kZh_9F9aBh39K. I have tested it and it works totally fine.

teopapad92 commented 11 months ago

hey @yl4579 @danielmsu, I ran into the same issue with some high pitched noise in the background.

After checking dependencies, new installs etc, it turns out it is related to the type of the GPU you use. I had a kind of old GPU (Quadro P5000) but when I switched to a new machine it was fine.

Also, in the old machine if I use device='cpu' it also works fine.

Not sure why this happens, maybe something related to the way older GPUs perform some operations or handle float representations?

However, I found another difference. For the example:

"Maltby and Co. would issue warrants on them deliverable to the importer, and the goods were then passed to be stored in neighboring warehouses."

in your audio samples (https://styletts2.github.io/). Co. is pronounced 'company' while in the notebook (and in my local machine) it is pronounced "co"

example from the notebook: https://vocaroo.com/1iCB2q0HOqLh

yl4579 commented 11 months ago

@teopapad92 You actually need this text cleaner (https://github.com/jaywalnut310/vits/blob/main/text/cleaners.py) to make “co.” pronounced as “company”. I used it for the paper and demo page samples but didn’t include it in the inference notebook, though I may add it later to be consistent with the demo page audios. However, it is just a matter of phonemization so people can do whatever they want. The training data was already phonemized this way as I took it directly from VITS.

As for the high pitch distortion, can you make sure it is not a problem of dependencies but GPU? Have you tested with the exact same environment and after changing GPU the distortion disappears? I have tried it on GPUs as old as NVidia 780 and it still works. I think it’s a great idea to open a new issue for this problem.

danielmsu commented 11 months ago

@yl4579 thank you for the answer and colab notebook, I switched to cpu as suggested by @teopapad92 and can confirm that the issue is gone

yl4579 commented 11 months ago

@danielmsu This is so weird, can you open a new issue so other people can have a reference?

astricks commented 11 months ago

Might be some library or framework that’s different, or maybe how different precisions are being handled by the GPU?

On Wed, Oct 4, 2023 at 12:15 PM Aaron (Yinghao) Li @.***> wrote:

@danielmsu https://github.com/danielmsu This is so weird, can you open a new issue so other people can have a reference?

— Reply to this email directly, view it on GitHub https://github.com/yl4579/StyleTTS2/issues/1#issuecomment-1747231651, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH2BVJHRFNXFUFIDVZAJNDX5WDRBAVCNFSM6AAAAAAZHSO332VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBXGIZTCNRVGE . You are receiving this because you commented.Message ID: @.***>

danielmsu commented 11 months ago

Tested one more time inside of a docker container (with pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel image) to make a clean install, results are the same. I have a card with Pascal architecture which doesn't have fp16 support I guess, and I remember having some issues with LLMs because of that I will experiment a bit more and see if I can solve it with drivers reinstall or something like that, and I will open a new issue if nothing helps

yl4579 commented 9 months ago

I have tested the current code and I was able to reproduce the models with similar quality to those used in the paper and demo. So I think issue is now complete. Please open new issues if there are more problems in the current code.

yl4579 commented 9 months ago

@nivibilla I've pushed the finetuning script and I tried it myself with one hour of data on LJSpeech using pre-trained LibriTTS model. It sounds better than TortoiseTTS for sure (both quality and speaker similarity), but it is still worse than models trained from scratch with full data (24 hours of audio). The quality is still better than VITS and JETS and close to NaturalSpeech, so I think it is good enough with one hour of data.

nivibilla commented 9 months ago

@yl4579 thanks so much. Will try it out when I get the time.

ghost commented 9 months ago

@yl4579 Can you share the minimum required audio length for the new speaker and how much time it takes to fine tune?

yl4579 commented 9 months ago

@primepake It depends on the quality you want to achieve. The more data the better, but I tried to finetune it with 10 minutes of audio and it still works, and the similarity is much better, though I wouldn’t say the naturalness is better.

danielmsu commented 9 months ago

@yl4579 Sorry for hijacking the conversation, but do I understand correctly that after fine-tuning we still need to provide a reference audio, but results are much more similar than with zero-short voice cloning?

yl4579 commented 9 months ago

@danielmsu Yes, but if it’s a single speaker dataset you are finetuning at the reference can be arbitrary and doesn’t effect the speech synthesized. You can also change the multispeaker flag to false and does not load the diffusion model when finetuning if you know your new dataset has only one speaker.