Open vijishmadhavan opened 10 months ago
Audio Type: <class 'numpy.ndarray'> Audio Shape: (1, 708608) Audio Data Type: float32 Sampling Rate: 16000
I am getting the same error while trying your model too.
from transformers import pipeline import scipy
model_id = "ylacombe/mms-spa-finetuned-argentinian-monospeaker" synthesiser = pipeline("text-to-speech", model_id) # add device=0 if you want to use a GPU
speech = synthesiser("Hola, ¿cómo estás hoy?")
scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"])
Hey @vijishmadhavan, thanks for opening an issue, I've reproduced your error and fixed it with:
scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"][0])
I'll change the README accordingly! thanks for flagging
Hey thank you, I trained with 2200 samples for 200 epochs. My results seem to sound wierd. Is this overfitting issue?
https://huggingface.co/datasets/Vijish/mozilla_mongolian2 https://huggingface.co/Vijish/mms https://huggingface.co/Vijish/vits_mongolian_monospeaker
Hi @vijishmadhavan, this is a really interesting use-case, thanks for sharing !
A few questions:
A few recommendations now:
Hey, Thank you so much, it worked. If the base model did not have emotions in their voice, can finetuning with a nice dataset help with bringing in emotion and sound less robotic.
You're welcome! How's the quality of the model you trained?
If the base model did not have emotions in their voice, can finetuning with a nice dataset help with bringing in emotion and sound less robotic.
Intuitively, I'd say that one single VITS checkpoint is suited for one speaker and one emotion. If you have a dataset with consistent emotion and consistent speaker, and considering you found the right hyper-parameters, my guess is that you'd have a good model.
BTW, I've recently fine-tuned an automatic speech recognition model on Mongolian, that'll be soon integrated to transformers, if that's of any interest to you!
from transformers import pipeline import scipy
model_id = "Vijish/vits_mongolian_monospeaker" synthesiser = pipeline("text-to-speech", model_id) # add device=0 if you want to use a GPU
speech = synthesiser("Монголын бурхан шашинтны төв Гандантэгчэнлин хийдийн Тэргүүн их хамба Д.")
scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"])
/usr/local/lib/python3.10/dist-packages/scipy/io/wavfile.py in write(filename, rate, data) 795 block_align = channels * (bit_depth // 8) 796 --> 797 fmt_chunk_data = struct.pack('<HHIIHH', format_tag, channels, fs, 798 bytes_per_second, block_align, bit_depth) 799 if not (dkind == 'i' or dkind == 'u'):
error: ushort format requires 0 <= number <= (0x7fff * 2 + 1)