yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.8k stars 393 forks source link

Awesome in english but no support for other languages - please add an example for another language (german, italian, french etc) #41

Open cmp-nct opened 10 months ago

cmp-nct commented 10 months ago

The readme makes it sound very simple: "Replace bert with xphonebert" Looking a bit closer looks like it's quite a feat to make StyleTTS2 talk in non-english languages (https://github.com/yl4579/StyleTTS2/issues/28)

StyleTTS2 looks like the best approach we have right now, but only english is a killer for many as it means any app will be limited to english without prospect for other users in sight.

Some help to get this going in foreign languages would be awesome.

It appears we need to change inference code and re-train text and phonetics. Any demo/guide would be great

ardha27 commented 10 months ago

Yes i want

yl4579 commented 10 months ago

@ardha27 I think it was already included in the processed dataset and epseak IPA results are good enough.

ardha27 commented 10 months ago

Is it already pushed to current branch? Sorry, but how i can use it?

yl4579 commented 10 months ago

@ardha27 No, it is included in the training data for multilingual PL-BERT model. The training hasn't started yet. I'm still waiting for the 8 GPU machine from @hobodrifterdavid

dsplog commented 9 months ago

For example, for the sentence "This is a test sentence", we get 5 tokens [this, is, a, test, sen#, #tence] and its corresponding graphemes. Particularly, these [sen#, #tence] two tokens correspond to ˈsɛnʔn̩ts. The goal is to map each of the grpaheme representation in ˈsɛnʔn̩ts to the average contextualized BERT embeddings of [sen#, #tence]. This requires running the teacher BERT model, but we can extract the contextualized BERT embeddings online (during training) and maximize the cosine similarity of the predicted embeddings of these words and the teacher model (multilingual BERT).

@yl4579 : are the changes for the subword tokenizations available?

yl4579 commented 9 months ago

@dsplog I haven't implemented them yet. I'm done with most data preprocessing and just need people to fix the following languages. If no response for these languages before I come back from NeurIPS (Dec 18), I will proceed to training the multilingual PL-BERT. I will have to remove Thai and using phonemizer results for the following languages.

bn: Bengali (phonemizer seems less accurate than charsiuG2P)
cs: Czech (same as above)
ru: Russian (phonemizer is inaccurate for some phonemes, like tʃ/ʒ should be t͡ɕ/ʐ)
th: Thai (phonemizer totally broken)
GayatriVadaparty commented 9 months ago

I think the GPUs provided by @hobodrifterdavid would be a great start for multilingual PL-BERT training. Before proceeding though, I need some people who speak as many languages as possible (hopefully also have some knowledge in IPA) to help with the data preparation. I only speak English, Chinese and Japanese, so I can only help with these 3 languages.

My plan is to use this multilingual BERT tokenizer: https://huggingface.co/bert-base-multilingual-cased, tokenize the text, get the corresponding tokens, use phonemizer to get the corresponding phonemes, and align the phonemes with tokens. Since this tokenizer is subword, we cannot predict the subword grapheme tokens. So my idea is instead of predicting the grapheme tokens (which is not a full grapheme anyway, and we cannot really align half of a grapheme to some of its phonemes, like in English "phonemes" can be tokenized into phone#, #me#, #s, but the actual phonemes of it is /ˈfəʊniːmz/, which cannot be aligned perfectly with either phone# or #me# or #s) we predict the contextualized embeddings from a pre-trained BERT model.

For example, for the sentence "This is a test sentence", we get 5 tokens [this, is, a, test, sen#, #tence] and its corresponding graphemes. Particularly, these [sen#, #tence] two tokens correspond to ˈsɛnʔn̩ts. The goal is to map each of the grpaheme representation in ˈsɛnʔn̩ts to the average contextualized BERT embeddings of [sen#, #tence]. This requires running the teacher BERT model, but we can extract the contextualized BERT embeddings online (during training) and maximize the cosine similarity of the predicted embeddings of these words and the teacher model (multilingual BERT).

Now the biggest challenge is aligning the tokenizer output to the graphemes, which may require some expertise in the specific languages. There could be potential quirks, inaccuracy or traps for certain languages. For example, phonemizer doesn't work with Japanese and Chinese directly, you have to first phonemize the grapheme into alphabets and then use phonemizer. The characters in these languages do not always have the same pronunciations depending on the context, so expertise in these languages is needed when doing NLP with them. To make sure the data preprocessing goes as smooth and accurate as possible, any help from those who speaks any language in this list (or knows some linguistics about these languages) is greatly appreciated.

Hey, I would love to work on this. I really liked the model that you've created. I'm using it in my work, just checking with different TTS models and comparing voice overs. I've just got to know style TTS need multilingual support. I can help with Telugu language training. I know people who know Hindi as well. I'm from India.

somerandomguyontheweb commented 9 months ago

Hi @yl4579, thank you for this awesome project. Just wanted to clarify if there are any plans to add support for Belarusian, my native tongue. Apparently espeak-ng supports it, but when I attempted to process Belarusian Wikipedia with preprocess.ipynb, I saw that the phonemization quality is rather poor: in particular, word stress is often wrong, and numbers are not expanded properly into numerals, even though the numerals are listed in be_list. Could you please let me know if there is anything I could help with, in order to add Belarusian to multilingual PL-BERT? (E.g. providing a dictionary of stress patterns for espeak-ng, improving numeral conversion rules, etc.)

iamjamilkhan commented 9 months ago

Please add hindi support as well

yl4579 commented 9 months ago

@somerandomguyontheweb You can join the slack channel and make the dataset yourself if you believe the espeak is bad. I will upload all the dataset I have soon.

yl4579 commented 9 months ago

@iamjamilkhan @GayatriVadaparty Hindi and Telugu are already added in multilingual PL-BERT training. I will upload the dataset soon. You can check the quality and let me know if something needs to be fixed.

GayatriVadaparty commented 9 months ago

@yl4579 Sure, I’ll do that.

yl4579 commented 9 months ago

I have uploaded most of the data I have: https://huggingface.co/datasets/styletts2-community/multilingual-pl-bert Please check if there's anything missing or not not ideal. To check whether the IPA is phonemized correctly for your language, you will need to decode the tokens by using https://huggingface.co/bert-base-multilingual-cased tokenizer. If something is wrong, please let me know. I probably will start multilingual PL-BERT training early next month (Jan 2024). The list of language correspond can be found here: https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md

SanketDhuri commented 9 months ago

Please add Marathi support as well

yl4579 commented 9 months ago

@SanketDhuri It is already included: https://huggingface.co/datasets/styletts2-community/multilingual-pl-bert/tree/main/mr You may want to check the quality of this data yourself because I don't speak this language.

acalatrava commented 8 months ago

@yl4579 Did you start the training? I may can help in Spanish (Spain) if needed.

mkhennoussi commented 8 months ago

I am here to help with French if needed !

cmp-nct commented 8 months ago

@yl4579 Did you start the training? I may can help in Spanish (Spain) if needed.

My last status: Training of ML-PL-Bert is planned to start during January (did not start yet) Once that is working the model itself can be trained

paulovasconcellos-hotmart commented 8 months ago

Hello. I'm interested in helping train a PT-BR model. I have corporate resources to do so. Let me know how I can help.

philpav commented 7 months ago

I'd love to see support for German accents like Austrian but I guess there's no dataset available.

agonzalezd commented 7 months ago

I could give linguistic support in most Iberian languages: Castilian Spanish, Basque, Catalan, Asturian and Galician. However, due to the orthographic nature of their respective scripts, using a BERT model based on text could also be enough for synthesising these languages

ashaltu commented 7 months ago

hello! also interested in adding support for the oromo (orm) language, espeak-ng has a phonemizer for it although it could be improved upon.

SpanishHearts commented 7 months ago

Any chances to include Bulgarian?

rlenain commented 7 months ago

Hi everyone -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert

Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!

Best of luck, and let me know what you make with this!

m-toman commented 7 months ago

Hi everyone -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert

Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!

Best of luck, and let me know what you make with this!

This is awesome. Going to try it.

Unfortunately it seems we got no language embeddings so that we could really train a multilingual model with cross-lingual capabilities atm?

rlenain commented 7 months ago

I have actually trained a model which can speak multiple languages, without the need of a language embedding. I guess the model learns implicitly, either based on the phonemisation, or based on the references, to speak with a specific accent

m-toman commented 7 months ago

@rlenain interesting, yeah I assume this would work, just a little bit uncomfortable to rely on it doing the right thing when you want one voice in multiple languages.

I thought maybe I could additively augment the style embedding with some language Infos. A bit like some early adapter models to keep English at +0 for the existing model and for the new training data in other languages add some linear layer result of a one hot encoding. Just a rough idea without much more thought yet ;)

Smithangshu commented 5 months ago

@dsplog I haven't implemented them yet. I'm done with most data preprocessing and just need people to fix the following languages. If no response for these languages before I come back from NeurIPS (Dec 18), I will proceed to training the multilingual PL-BERT. I will have to remove Thai and using phonemizer results for the following languages.

bn: Bengali (phonemizer seems less accurate than charsiuG2P)
cs: Czech (same as above)
ru: Russian (phonemizer is inaccurate for some phonemes, like tʃ/ʒ should be t͡ɕ/ʐ)
th: Thai (phonemizer totally broken)

I am a native Bengali speaker from India. Please let me what kind of help I can offer.

Dmytro-Shvetsov commented 5 months ago

@rlenain, thank you for your awesome work! Do I understand correctly that the multilingual PL-BERT is just a starting point to building StyleTTS2 models other than English? Or should it work with other languages out of the box? If yes, could you share insights which parts of the code should be modified for inference pipeline (e.g I assume the phonemizer for the target language, maybe the style audio to be with the speaker of target language)?

rlenain commented 5 months ago

You need to further finetune or train from scratch with PL-BERT. It won't work in inference mode only. That's because if you change it, then the outputs of the PL-BERT module will not be "aligned" with other modules that expect the PL-BERT outputs as inputs.

This is generally true with any ML model -- if you change a module, then you need to further train / finetune to be able to get the model to work.

LordSyd commented 2 months ago

I tried finetuning in German using around 1h of data and using the multilingual BERT, but even training for 50 epochs did not yield a model that could generate coherent text.

The only parameters I changed in the config_ft.yaml were: batch_size: 2 max_len: 600

diff_epoch and joint_epoch: tried different values, but also used the standard 10 and 30.

What I find curious is that the generated speech regarding tonality and inflection sounds close to the reference, but the content is just gibberish. I thought it might be the data, but maybe someone with a little more experience in fine-tuning can tell me if this might be an issue that isn't data-related?

Also, general question: I am unsure if I need the original LibriTTS Dataset in the data folder for fine-tuning? Because the OOD_texts .txt points to nonexisting files and the way the fine-tune tutorial is written it is not clear if we just need the OOD_texts file or the files it points to as well.

Edit: So after playing around some more I decided to make my own OOD_texts file, and now at least the sentences the model generates are understandable as German. Still, the generation quality is not very high, even using 50 epochs to train. I have around 1h of audio, is this still too little?

mikhail2013ru commented 1 month ago

Hello) Let's make it easier:

  1. Write detailed instructions in English with an example of how to prepare a dataset?
  2. How to properly clean from noise, which vst plug-ins are necessary for balanced sound?
  3. What duration is needed?
  4. How many epochs do you need?
  5. Whether it is necessary to teach Bert separately for the required language.
  6. Show the result of successful model training in another language to those who have already done it. And share your experience, how did you do it?
  7. Have you been trained on a home video card or an industrial server?
  8. I want to teach in Russian.
  9. Is it possible to create a dataset of their similar-sounding languages?
  10. For example, 25 hours of Russian, 25 hours of Bulgarian, and so on.
  11. I now have 25 hours of recordings of audiobooks of a pleasant announcer's voice, can I make a voice model from just this one voice and get high quality?
mzdk100 commented 1 month ago

Suggestions can refer to gpt-so-vits This open-source implementation method for TTS in multiple languages is really great, and it can be said that it is the best Chinese TTS, including Japanese, English, and Korean.

juangea commented 4 days ago

@LordSyd what are the steps you took?

I have several hours of video of myself in Spanish, and I could prepare a dataset using Whisper to transcribe, however I’m not sure what do I have to do.

I have a 4090, I can try the fine tuning with way more than 1 hour of data, but some help to know the steps is welcome, I have no idea :)

ichDaheim commented 2 days ago

@LordSyd did you had any luck in training a German model ? I would be highly interested in the outcome.