ylacombe / finetune-hf-vits

Finetune VITS and MMS using HuggingFace's tools
MIT License
124 stars 31 forks source link

Inference error #3

Open vijishmadhavan opened 10 months ago

vijishmadhavan commented 10 months ago

from transformers import pipeline import scipy

model_id = "Vijish/vits_mongolian_monospeaker" synthesiser = pipeline("text-to-speech", model_id) # add device=0 if you want to use a GPU

speech = synthesiser("Монголын бурхан шашинтны төв Гандантэгчэнлин хийдийн Тэргүүн их хамба Д.")

scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"])

/usr/local/lib/python3.10/dist-packages/scipy/io/wavfile.py in write(filename, rate, data) 795 block_align = channels * (bit_depth // 8) 796 --> 797 fmt_chunk_data = struct.pack('<HHIIHH', format_tag, channels, fs, 798 bytes_per_second, block_align, bit_depth) 799 if not (dkind == 'i' or dkind == 'u'):

error: ushort format requires 0 <= number <= (0x7fff * 2 + 1)

vijishmadhavan commented 10 months ago

Audio Type: <class 'numpy.ndarray'> Audio Shape: (1, 708608) Audio Data Type: float32 Sampling Rate: 16000

vijishmadhavan commented 10 months ago

I am getting the same error while trying your model too.

from transformers import pipeline import scipy

model_id = "ylacombe/mms-spa-finetuned-argentinian-monospeaker" synthesiser = pipeline("text-to-speech", model_id) # add device=0 if you want to use a GPU

speech = synthesiser("Hola, ¿cómo estás hoy?")

scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"])

ylacombe commented 10 months ago

Hey @vijishmadhavan, thanks for opening an issue, I've reproduced your error and fixed it with: scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"][0])

I'll change the README accordingly! thanks for flagging

vijishmadhavan commented 10 months ago

Hey thank you, I trained with 2200 samples for 200 epochs. My results seem to sound wierd. Is this overfitting issue?

https://huggingface.co/datasets/Vijish/mozilla_mongolian2 https://huggingface.co/Vijish/mms https://huggingface.co/Vijish/vits_mongolian_monospeaker

vijishmadhavan commented 10 months ago

https://github.com/ylacombe/finetune-hf-vits/assets/53169213/9cb8c2ed-2c13-495f-a88e-62be950d5b81

ylacombe commented 10 months ago

Hi @vijishmadhavan, this is a really interesting use-case, thanks for sharing !

A few questions:

  1. Could you send the hyper-parameters that you used?
  2. Did you filter your dataset on a single speaker ? You might want to filter your dataset to keep only one speaker. From what I hear, there are a few different speakers in yours.
  3. I don't see the discriminator here: https://huggingface.co/Vijish/mms, did you follow step 2?

A few recommendations now:

  1. You can use wandb to keep track of the log of your experiment -> to better understand where it doesn't learn and to share it here. You have to add this line to your config.
  2. Keep in mind that you don't really need 2200 samples but much less.
  3. Also keep in mind that the quality of your input dataset will reflect on the generated data
  4. If you lower the number of samples (let's say to 200-ish), you can experiment with the hyperparameters. My guess is that you only need to play with the learning rate first. Let me know if that helps
vijishmadhavan commented 10 months ago

Hey, Thank you so much, it worked. If the base model did not have emotions in their voice, can finetuning with a nice dataset help with bringing in emotion and sound less robotic.

ylacombe commented 10 months ago

You're welcome! How's the quality of the model you trained?

If the base model did not have emotions in their voice, can finetuning with a nice dataset help with bringing in emotion and sound less robotic.

Intuitively, I'd say that one single VITS checkpoint is suited for one speaker and one emotion. If you have a dataset with consistent emotion and consistent speaker, and considering you found the right hyper-parameters, my guess is that you'd have a good model.

BTW, I've recently fine-tuned an automatic speech recognition model on Mongolian, that'll be soon integrated to transformers, if that's of any interest to you!