yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.37k stars 340 forks source link

Multi-lingual training #257

Open nvadigauvce opened 5 days ago

nvadigauvce commented 5 days ago

Thanks for wonderful work which gives good expressive TTS for English speakers. I was planning for Indian Multi-lingual TTS. For this purpose, I have few questions.

  1. Do we need to change only data and PL-BERT model or any other changes required ?
  2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ?
  3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
  4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?
SandyPanda-MLDL commented 5 days ago

You have to train the PL-bert model with the specific dataset of that particular language you want. A text dataset of size more than 30MB is also sufficient enough, though you can use larger dataset. Then use that trained PL-bert model in StyleTTS2. As you want to work with multilingual data, then of course you need to use specific phonemizer and tokenizer that supports that specific language. And you have to train StyleTTS2 (training stage1 and stage2) with the specific language dataset (train.txt, validate.txt and odd.txt).

nvadigauvce commented 5 days ago

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.

How about other three questions?

  1. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ?
  2. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
  3. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?
traderpedroso commented 3 days ago

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.

How about other three questions? 2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ? 3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ? 4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

According to the documentation in the readme, it states that the ASR model performs well in other languages. I tested it and indeed it works fine. However, when I trained my ASR model, StyleTTS improved dramatically. After this, I decided to train all models with my own data and achieved results exactly in terms of quality that the model delivers in English.

nvadigauvce commented 3 days ago

@traderpedroso Thanks for reply.

  1. Did you Finetune ASR model (https://github.com/yl4579/AuxiliaryASR) on the top of existing ASR model or trained from scratch with multiple languages ?

  2. Did you also tried to train PL-BERT model with multiple languages ? if yes, then can we combine multiple languages, do we need to give equal amount of training data for each language ?

traderpedroso commented 3 days ago

@traderpedroso Thanks for reply.

  1. Did you Finetune ASR model (https://github.com/yl4579/AuxiliaryASR) on the top of existing ASR model or trained from scratch with multiple languages ?
  2. Did you also tried to train PL-BERT model with multiple languages ? if yes, then can we combine multiple languages, do we need to give equal amount of training data for each language ?

I used the PL-BERT recommended in the multilingual repository https://huggingface.co/papercup-ai/multilingual-pl-bert and it worked perfectly for ASR. I tested it with fine-tuning and also tried training from scratch; both approaches gave me the same result. Clearly, the ASR that I trained from scratch was for a single language.

From my experience training StyleTTS 2, it's only worthwhile because the inference is very fast and consumes little VRAM, but the training cost makes it somewhat unfeasible. Besides, you can only train the second stage with a single GPU. Clearly, I didn't train the model from scratch, which would be even more expensive, but I can guarantee that the quality is sensational. Another advantage of StyleTTS 2 is that it doesn’t hallucinate; the generated audios are extremely reliable, especially for real-time streaming applications that don’t need monitoring. However, in terms of cost vs. benefit, I personally prefer Tortoise for the final outcome.

nvadigauvce commented 3 days ago

@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.

  1. My use case is for Multi-lingual TTS with Indian languages, but Indian languages are not part of PL-BERT (https://huggingface.co/papercup-ai/multilingual-pl-bert ), so do you think can we still use multilingual-pl-bert for unseen languages ?
  2. For multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
  3. Do we need to add any language id while doing data preparation for multi-lingual use case, similar to adding speaker id in train_list.txt/val_list.txt? because while inferencing how it will know which language to select ?
traderpedroso commented 1 day ago

@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.

  1. My use case is for Multi-lingual TTS with Indian languages, but Indian languages are not part of PL-BERT (https://huggingface.co/papercup-ai/multilingual-pl-bert ), so do you think can we still use multilingual-pl-bert for unseen languages ?
  2. For multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
  3. Do we need to add any language id while doing data preparation for multi-lingual use case, similar to adding speaker id in train_list.txt/val_list.txt? because while inferencing how it will know which language to select ?

Ensure that the speaker IDs are numbers. I personally used large numbers for the IDs, such as 3000, 3001, etc. You need to fine-tune the multilingual-pl-bert with your language if it is not listed. You do not need to add a language ID. Keep the data as in the example in the Data folder.

I added data in the same language I trained within the Data/OOD_texts.txt, but honestly, I believe it has no relevance because in the first 20 epochs I trained with the original Data/OOD_texts.txt, and the model was already generating quality audios.

In the inference, you need to put a dropdown list to select the language for your G2P, in this case, phonemizer, or use a library that detects the language and switches the lag in the phonemizer, for example, en-us, it, fr, etc.

nvadigauvce commented 1 day ago

@traderpedroso thanks for answering all my questions in detailed manner. I will try to build multi-lingual TTS model and will report if it is successful.