Closed SynthAether closed 1 year ago
For your first question, you will need to change the SOS index in model.py, but as you said, it doesn't matter if the semicolon never appears in your text.
For your second question, token 0 in StyleTTS was used as SOS and EOS instead of blank. Blank is index 16. You can see there is no SOS and EOS in StyleTTS. Also, for the text aligner, we removed the SOS token at the beginning of the attention, so it doesn't matter in the end.
@shaun95 Were you able to get reasonable results by training everything from scratch? I also retrained AuxiliaryASR and PitchExtractor, but training StyleTTS I am not able to achieve good results for a multi-speaker dataset I am using.
@lexkoro I was able to train all the necessary models from scratch, I had to make some guesses on some code on the AuxiliaryASR, since if you compare the code (models) they are some slight differences with those provided under StyleTTS repo. While the StyleTTS speaks, there are some quality issues like short pauses inserted between words, especially on long utterances, and some phonemes repeat which reminded me of the old gen of Tacotrons. I am guessing it has something to do with the TMA code I trained from scratch with my adaptations. It's a pity as it takes a long time to get all of these separate models to train and put them together. Currently the quality I get is not as the reference implementation though the emotion is definitely working, but overall the TTS quality is still not on par with the demo samples. Incidentally, I am using different female speaker dataset and I also am not training on multiple speakers.
@lexkoro Do you mean your ASR and F0 model performs worse than those I released on the same dataset?
@shaun95 The training code should be the same for the ASR models regardless of how StyleTTS uses it. I only modified the inference code to incorporate the CE loss for alignment, which was unimportant in practice. I have the recipe for official Hifi-GAN and BIGVGAN in 22.05 kHz if you need that. It works as well as the demo on LJSpeech, though I don't have time to test on multispeaker as I'm working on E2E training now. It will be named StyleTTS 2, where you won't need to retrain any components for E2E training, though you will need more RAM for that. The quality is also much better with E2E training, so stay tuned for that.
@yl4579 thank you for your response. The audio samples you provided for StyleTTS are excellent. My goal was to replicate the training steps in StyleTTS so that I could test a new voice on a model trained on a single speaker, without relying on the provided checkpoints for Pitch or ASR. To achieve this, I trained a new AuxiliaryASR and PitchExtractor using the original training repositories as indicated in the StyleTTS readme file. However, I faced some issues after the training of each was completed and tried to load them into StyleTTS, so I decided to make some changes to the code in each training repository (and retrain). For instance, I assumed that the dropouts applied in the Pitch training repository should be the same as the ones done in StyleTTS/Utils/JDC. [Edit: I missed the remark on E2E]. Once again, thank you for this repository, I appreciate it greatly. I look forward to StyleTTS2
@shaun95 Thanks for the reply! I experience the same and I know from at least another person who was also not able to reproduce the quality as provided in the examples.
@yl4579 Hey, no the newly trained ASR and F0 seemed fine looking at the losses. Training ASR was a bit challenging, since it would sometimes NaN after the first few steps. I trained on a highly expressive dataset from video games in the languages german, english, polish, russian and japanese.
Now I tried training StyleTTS on a small subset of the data used for training F0 and ASR. But I am not able to achieve satisfying results. As @shaun95 mentioned the style transfer works quite well but the quality doesn't seem to improve or rather said is nowhere near as the released examples.
Training E2E will be a welcome change, since training all those components is quite cumbersome! :) Thanks for open-sourcing your code! Even if I won't be able to reproduce the results with my own data, I really appreciate you sharing the project! 👍
@shaun95 You don't need to modify anything. You can just train it with the exact same code I pushed and load that checkpoint for StyleTTS. There's no dropout in StyleTTS because during inference we do not need those things. They were only used for training.
@lexkoro That is weird because all the models I released were trained with exactly the same code I pushed to the repo. I also trained the ASR and F0 models with HifiGAN recipes in 22.05 kHz and was able to reproduce the results on LJSpeech, so I believe the code should be good. I'm wondering if it could be a problem with the dataset. Can you maybe test it with the pre-trained model directly on your dataset, without changing anything?
@yl4579 OK, I am going to re-examine my steps. What about the use of g2p_enu versus phonemizer in the text-aligner (StyleTTS vs ASR)? Don't I need to account for that and make both use the same or it does not matter? I appreciate your feedback!
@shaun95 so ASR repo uses G2P, but StyleTTS uses phonemizer, which is the major difference. You can easily change the ASR repo to use phonemizer by using the exact same config file and tokenizer used in StyleTTS. You don't need to change any code except meldataset.py.
@yl4579 Testing with the pre-trained model the synthesized quality is much better. Of course the speaker similarity is sometimes quite of which makes sense since it's zero shot in this case.
So now I am wondering where the problem in my training might stem from. I have modified the meldataset.py in each module to use the mel-processing from BigVGAN. Additionally I tried increasing the style_dim to 256 and dim_in to 128, since I have a high variation in speech, not sure this one even makes sense. But the problem also persisted with the default settings.
Personally it feels like that in my case the pitch of the synthesized speech is somewhat off making the voices either to high or to low pitched and it sounds wobbly/ scratchy.
How long did you train the LibriTTS model? Judging by the config it did converge quite fast.
My changes can be found here https://github.com/lexkoro/StyleTTS
@lexkoro, based on my experience, I didn't encounter any issues with the intonation being too high or low during my experiments. The short utterances sounded good, but I noticed a decline in quality for medium to long utterances. Specifically, I observed unnatural pacing of words with pauses between them, and sometimes mumbling occurred. However, the emotion style transfer worked well overall.
During my training, I trained all models from scratch three times consecutively, once with TMA_CE loss set on and twice with it off. Although I didn't achieve satisfactory alignment during training with StyleTTS when using TMA_CE set to True, it was still better than when not using it. Additionally, I trained a new HiFiGAN model on the speaker dataset, and the synthesizer worked well without any signal issues. When judging the vocoder alone, the speech output sounded fine.
However, I do believe that the pacing of words and mumbling could potentially indicate an alignment issue that may be related to my training of AuxiliaryASR or StyleTTS. One possible contributing factor to this issue is that I attempted to use and share similar code in the model.py across all repositories. I did this to alleviate concerns about loading newly trained models from AuxiliaryASR or Pitch, ensuring when they are loaded from StyleTTS they follow the same flow of code as when they were originally trained.
@shaun95 Can you look at the alignment from the Tensorboard log to see if it looks monotonic?
@lexkoro For LirbiTTS, I trained it only for around 50 epochs for the first stage and 30 epochs for the second stage.
@yl4579 In the two instances where I set TMA_CE to False, the alignment was consistently monotonic. For my responses in this forum, I utilized the output from these training sessions to report my findings. Both attempts had a similar quality. However, when TMA_CE was set to True, the alignments were not accurate and the speech output was unintelligible. I did not investigate the reason for this, as I proceeded with experiments using TMA_CE=False, particularly after noticing that the default setting in this repository was set to False not too long ago. Your feedback is appreciated.
@yl4579 - I've been reading through this and have a couple of questions regarding the AuxiliaryASR and PitchExtractor changes you've made in the StyleTTS code:
AuxiliaryASR - this line: https://github.com/yl4579/AuxiliaryASR/blob/734869027f89beedce598e10bc9df76688d677a3/models.py#L148 In the StyleTTS repo you pass in the "alignment" whereas in the original AuxiliaryASR they pass in the "attention_weights" - why this change? (this is of course also part of the attention layer output change - but still, why?)
PitchExtractor - You've modified the forward pass quite a bit, simply to use it as an inference step - why not just add an inference method and keep the forward pass from training as is? In the same vein, why change the dropout values? You also seem to assume a different shape of the input - how come?
Thanks a lot for some great work!
@shaun95 Sorry for the late reply. I didn't see this until now. Hope your problem was resolved, but if not, here is my two cents for your problem: It would be a good idea to check the performance after just the first stage of training. How was the quality of the model after the first stage, like did the reconstruction work well?
@RasmusD The change was to make the alignment over the phoneme axis instead of over the mel axis. In the AuxiliaryASR, it was trained to align melspectrograms with texts (i.e., the input is melspectrograms and the output is the text) because it is an ASR model, while in TTS we want the input to be texts and the output is the melspectrogram. The latter is the reverse of the problem, but you cannot simply transpose the attention matrix because the attention for AuxiliaryASR was normalized across the melspectrogram frames instead of the phoneme tokens. I changed it specifically to renormalize it in the correct axis.
As for the pitch extractor, it was my fault that I didn't add an inference method but just changed the model directly. It was just for simplicity, as there should be no dropout during inference. The shape of input during training of the F0 model was fixed to be 192, while the shape of input for TTS should be variable, so I removed that constraint.
That's great - thanks a lot for the clarifications!
I'm currently trying to replicate the results from your pretrained model, but I'm not getting the expected results when training StyleTTS using an AuxiliaryASR model that I trained myself. When using the pretrained ASR model provided in the repo, I get diagonal alignments after training StyleTTS for only one epoch:
However, when training AuxiliaryASR myself, the alignments are incorrect after even 20 epochs:
For training the AuxiliaryASR model, I used the config from this repo provided in Utils/ASR/config.yml
, and I trained it on the files from train_list.txt
in the AuxiliaryASR repo (i.e. the subset of LJSpeech, resampled to 24 kHz). The strange thing is that it seems that the ASR model trained correctly, looking at the eval alignments logged to tensorboard while training:
Any ideas @yl4579 what might have gone wrong here? Should I have trained AuxiliaryASR on a different dataset, did I not use the correct config file for AuxiliaryASR, or is it something else?
In any case, thanks for providing the code and models for this project!
@glimperg did you make sure to update the dataloader to conform to the StyleTTS dataloader?
@RasmusD Thanks! I was in fact not using the StyleTTS dataloader for training the AuxiliaryASR model. After changing it, I seem to get similar results compared to using the pretrained AuxiliaryASR model.
Great work, thanks.