How to add unsupported language? (`nob`)

ylacombe / finetune-hf-vits

Finetune VITS and MMS using HuggingFace's tools

MIT License

115 stars 25 forks source link

How to add unsupported language? (`nob`) #18

Open thomasht86 opened 7 months ago

thomasht86 commented 7 months ago

Strangely enough, I can see from MMS coverage that nob (Norwegian) is not supported for TTS.

What must be done in order to support it?

ylacombe commented 7 months ago

Hi, unfortunately I don't think the authors are planning to release other MMS models. From our side, to release such a model, we'd need a good Norwegian TTS datasets, do you have such a dataset in mind? Would you also be interested in training such a model from scratch? If so, let me know, I can give you some pointers and some help, Best

thomasht86 commented 7 months ago

Hi! Thanks for the reply! There are at least two large datasets available in Norwegian:

I have tried running your scripts, and converting a checkpoint from https://huggingface.co/facebook/mms-tts-swe, with the premise that Swedish and Norwegian is quite similar. Had to modify vocab.json manually, by adding two characters [æ,å] to map to same token_id as their Swedish "counterparts".

I have played around with different learning rates, and parameters, but I consistently get infinity for KL loss, and NaN loss after 100 steps or so....

If you could give pointers for training a model from scratch, I could give it a shot. 😊

ylacombe commented 7 months ago

How did you initialize the model ? This might have an important role. [EDIT:] looking at this model, it seems okay, did you initialize it from scratch ?

Also which hyper-parameters did you use ? I'll recommend using the default one from the Vits original training.

thomasht86 commented 7 months ago

No, I generated that one from the swedish model, with convert_original_discriminator_checkpoint.py. But starting with a model initialized from another language might probably require some tricks to finetune..?

I used the hyperparameters you provided in https://github.com/ylacombe/finetune-hf-vits/tree/main/training_config_examples as basis, but did a "random manual search" from there.

Where can I find the default ones from original training?

ylacombe commented 7 months ago

In that case, here is a snippet that you can modify to initialize from scratch:

from utils.configuration_vits import VitsConfig
from utils.modeling_vits_training import VitsModelForPreTraining
from utils.feature_extraction_vits import VitsFeatureExtractor
from transformers import AutoTokenizer

NEW_REPO_ID = ...

config = VitsConfig.from_pretrained("thomasht86/mms-tts-nob")
VitsModelForPreTraining(config).push_to_hub(NEW_REPO_ID)

VitsFeatureExtractor.from_pretrained("thomasht86/mms-tts-nob").push_to_hub(NEW_REPO_ID)
AutoTokenizer.from_pretrained("thomasht86/mms-tts-nob").push_to_hub(NEW_REPO_ID)

In terms of training, I'd advice:

focusing on a single speaker per model, as it will facilitate training and be of better quality
follow the original hyper-parameters (learning rate and loss weights): here

JackismyShephard commented 7 months ago

I am in sort of the same situation but looking to finetune MMS for danish (which is very similar to norwegian).

I am having trouble understanding where the above code snippet fits into the training pipeline. Should it be executed after converting a checkpoint using the convert_original_discriminator_checkpoint.py script?