Open thomasht86 opened 7 months ago
Hi, unfortunately I don't think the authors are planning to release other MMS models. From our side, to release such a model, we'd need a good Norwegian TTS datasets, do you have such a dataset in mind? Would you also be interested in training such a model from scratch? If so, let me know, I can give you some pointers and some help, Best
Hi! Thanks for the reply! There are at least two large datasets available in Norwegian:
I have tried running your scripts, and converting a checkpoint from https://huggingface.co/facebook/mms-tts-swe, with the premise that Swedish and Norwegian is quite similar.
Had to modify vocab.json manually, by adding two characters [æ,å]
to map to same token_id as their Swedish "counterparts".
I have played around with different learning rates, and parameters, but I consistently get infinity for KL loss, and NaN loss after 100 steps or so....
If you could give pointers for training a model from scratch, I could give it a shot. 😊
How did you initialize the model ? This might have an important role. [EDIT:] looking at this model, it seems okay, did you initialize it from scratch ?
Also which hyper-parameters did you use ? I'll recommend using the default one from the Vits original training.
No, I generated that one from the swedish model, with convert_original_discriminator_checkpoint.py
.
But starting with a model initialized from another language might probably require some tricks to finetune..?
I used the hyperparameters you provided in https://github.com/ylacombe/finetune-hf-vits/tree/main/training_config_examples as basis, but did a "random manual search" from there.
Where can I find the default ones from original training?
In that case, here is a snippet that you can modify to initialize from scratch:
from utils.configuration_vits import VitsConfig
from utils.modeling_vits_training import VitsModelForPreTraining
from utils.feature_extraction_vits import VitsFeatureExtractor
from transformers import AutoTokenizer
NEW_REPO_ID = ...
config = VitsConfig.from_pretrained("thomasht86/mms-tts-nob")
VitsModelForPreTraining(config).push_to_hub(NEW_REPO_ID)
VitsFeatureExtractor.from_pretrained("thomasht86/mms-tts-nob").push_to_hub(NEW_REPO_ID)
AutoTokenizer.from_pretrained("thomasht86/mms-tts-nob").push_to_hub(NEW_REPO_ID)
In terms of training, I'd advice:
I am in sort of the same situation but looking to finetune MMS for danish (which is very similar to norwegian).
I am having trouble understanding where the above code snippet fits into the training pipeline. Should it be executed after converting a checkpoint using the convert_original_discriminator_checkpoint.py
script?
Strangely enough, I can see from MMS coverage that
nob
(Norwegian) is not supported for TTS.What must be done in order to support it?