yanggeng1995 / vae_tacotron

MIT License
51 stars 19 forks source link

My results are not as good as the examples #5

Open AndroXD opened 5 years ago

AndroXD commented 5 years ago

I trained my model with LJSpeech up to 63K with the latest repo changes, then compared my results with http://home.ustc.edu.cn/~zyj008/ICASSP2019/ ( Style transfer > 1. Parallel transfer > 3." I s'pose it's because I am Ojo the Unlucky that everyone who tries to help me gets into trouble . ") But my results sounds completely different than that: https://drive.google.com/drive/folders/1HJ5pCmQshx5zj0Jkfpeu0GO0o_bu031F?usp=sharing Sharing my pretrained model if anyone interested: https://drive.google.com/file/d/1vpwyj_BaWIxIMw8y7ZsPpBPw2nWCckZU/view?usp=sharing So I went back to the old repo with pretrained model up to 88K shared by ZohaibAhmed: https://github.com/yanggeng1995/vae_tacotron/issues/4#issuecomment-457888310 And it still sounds pretty much the same as mine. Is there any way to improve the results? I can't get any feelings coming out from my synthesized voices. Thanks for sharing your work and support so far.

rishikksh20 commented 5 years ago

Same here I have trained model till 200K and when I passed referenced audio , it distort everything and generate some different voice. Check here is the pretrained models and alignement plots: https://drive.google.com/drive/folders/1wv74385TucxUmNm9MkB30QviPO12eZMN?usp=sharing

For audio samples: With 200k trained model on LJSpeech with reference audio LJSpeech audio : reference audio : LJSpeech itself generated samples : https://soundcloud.com/rishikesh-kumar-1/sets/vae_tacotron

With 200k trained model with LJSpeech with different voice as reference voice : reference audio : https://soundcloud.com/rishikesh-kumar-1/reference_audio?in=rishikesh-kumar-1/sets/vae_tacotron_with_different_reference_voice generated samples : https://soundcloud.com/rishikesh-kumar-1/eval-200000-0?in=rishikesh-kumar-1/sets/vae_tacotron_with_different_reference_voice

Might be it only works with same speaker as reference and training audio as author it self mentioned in conclusion that In addition, the scope of style transfer research will further extend to multi-speakers, instead of single speaker..

rishikksh20 commented 5 years ago

@AndroXD @yanggeng1995 One observation, as I go through the samples uploaded by author of this paper here : http://home.ustc.edu.cn/~zyj008/ICASSP2019/ , I noticed that VAE tacotron distort and changes the original speaker voice which is not in-case of GST- Tacotron. One thing is which might be leads to this is the addition process of z to the encoder_output because it definitely changed the encoder output whereas in GST tacotron they concatenate so that it won't hinder the original encoder output. May be it required lots of data and training to produce good result as author mentioned that they have used 105 hrs of voice data of single speaker.

AndroXD commented 5 years ago

I came to a point where I'm almost sure that no amount of effort I put into this is ever going to make it sound as good as the demos, at least not without doing some changes to the code. Before the repo update, I trained with LJSpeech up to 543000 steps and I would always get the same emotionless voice no matter what reference audio I used, it would always sound the same. Here's the pretrained model if anyone wants to give this a try, I used commit fb1878810b24e2762d7ccbe53a13a335a650b45d for training, it's likely not going to work with the latest repo update: https://drive.google.com/file/d/12dfdRT9iHgUQG3VEvIZlXuhHBUCZZ_l9/view?usp=sharing I thought the blame was LJSpeech dataset as the speaker is emotionless almost all the time, so I trained using the latest update with Blizzard as I know the speaker put emotion in the voice. I still couldn't see any improvement, so I decided to cut down the bullshit, I opened the log and copied some of the "step-XXX-audio.wav" files that were generated during the training which had emotions in them, the idea was to use those files as reference audio along with the same exact text lines that were used for training, if it couldn't repeat the same exact results it had already learned, then how would it possibly speak new content with emotion? And guess what, there's still NO emotions even if the files it were trained on definitely HAD them, check this out, [train] have emotions but [eval] don't, why?: https://drive.google.com/drive/folders/176IxXWpBM5TytBTRWujFGUxIlZBiJQMs?usp=sharing

yanggeng1995 commented 5 years ago

Sorry for the late reply. VAE is not easy to train, and parameters need to be adjusted for experiments, especially the vae_weight part, I have not got good results yet, and I will update when there is any new progress

G-Wang commented 5 years ago

Note also that the paper's dataset is the Blizzard dataset, which has alot of prosody variations in the reading, where the reader will do different style voices. This is why this dataset is commonly used by all prosody control TTS papers.

LJSpeech dataset is largely mono-tone, where the reader reads all the same style for all books. So I doubt the VAE can learn a varied enough style from LJSpeech.