ylacombe / finetune-hf-vits

Finetune VITS and MMS using HuggingFace's tools
MIT License
98 stars 21 forks source link

Speaker_id during inference #5

Open Srija616 opened 6 months ago

Srija616 commented 6 months ago

Hi @ylacombe! I have a multi-speaker data using which I have trained the hindi checkpoint. I wanted to generate a particular speaker's voice during inference. Is there any way to do that using the inference code given in the README?

Here is how my current code looks: `import scipy

from transformers import pipeline import time model_id = "./vits_finetuned_hindi" synthesiser = pipeline("text-to-speech", model_id, device=0) # add device=0 if you want to use a GPU speech = synthesiser("वहीं पंजाब सरकार ने सरबत खालसा के आयोजन के लिए, पंजाब के भठिंडा ज़िले में, तलवंडी साबो में, जगह देने से मना कर दिया है।") scipy.io.wavfile.write("hindi_1.wav", rate=speech["sampling_rate"], data=speech["audio"][0])`

ylacombe commented 6 months ago

Hey @Srija616, you can use the kwarg speaker_id like this:

forward_params = {"speaker_id": XXXX}
text = "वहीं पंजाब सरकार ने सरबत खालसा के आयोजन के लिए, पंजाब के भठिंडा ज़िले में, तलवंडी साबो में, जगह देने से मना कर दिया है।"

speech = synthetiser(text, forward_params=forward_params)

Did you finetune using the multi-speaker feature from the training code ? Also, I'm quite curious about your feeling on the quality of the model, don't hesitate to let me know, Best

Srija616 commented 6 months ago

@ylacombe Yes we have two speakers for Hindi (male, female) and these are the two params we tweaked to enable multispeaker training. Just wondering if there are other params that need to be defined for multispeaker training. image

We are also facing two issues:

  1. During finetuning, the train_loss_kl and val_loss_kl are both going to infinity - tested it with the English finetuning using ylacombe/mms-tts-eng-train model and here too, we are facing the same problem. The train_loss_disc has NaN values and the mel loss for both train and validation are not converging. The synthesized sample although seems to good with pronunciation and naturalness for English. For Hindi, we have pronunciation and naturalness issues.

  2. The speaker generated by the model is not similar to the speaker of the dataset, even though I passed the speaker_id as you mentioned in the previous comment.

Adding the wandb charts for our Hindi and English runs:

  1. Hindi image

  2. English image

@ylacombe Was wondering if you have some thoughts on why these losses are going to infinity or Nan. It is possible that we are missing something trivial.

I can share the generated samples over mail, if you'd like to here.

ylacombe commented 5 months ago

Hey @Srija616, sorry for the late response! Nice project here! If you have two speakers, I'd recommend finetuning two different times, since the original model only had one speaker, so the speaker embeddings must be learn from scratch.

Can you send me your training config ? I do have some great finetuning results on a single speaker fine-tuning

gsyllas commented 1 week ago

Hello @ylacombe , as I am currently finetuning the mms_tts_ell on a single speaker dataset would it be possible to assist me with the training configurations? My dataset consists ~4 hourds.