yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.95k stars 417 forks source link

Trained StyleTTS2 for Hindi but didn't get good results #286

Open SandyPanda-MLDL opened 1 month ago

SandyPanda-MLDL commented 1 month ago

We have trained StyleTTS2 model for Hindi language. Initially we trained PL-bert for Hindi considering Espeak phonemizer and Indicbert tokenizer. Then we utilized that newly trained Hindi PLbert by replacing the English PLbert. Othe components like ASR, JDC net we didn't trained separately for Hindi. Then we executed StyleTTS2 stage1 training code for 200 epochs (even we tried with epoch number more than its 2 to 3 times as well) and used that chekpt to execute the stage 2 code for 200 epochs (we also tried with epoch number more than its 2 to 3 times). Even after this the inference samples are not very natural (unable to replicate the phoneme level information, means pronunciations appropriately) and yes of course in the inference code we changed the English PLbert with HindiPLbert. What could be the probable reasons if any one can suggest. The samples of LJspeeh and LibriTTS showed in the Github repo of StyleTTS2 are excellent but Hindi language we didn't get that good samples. We trained the PLbert model with 30 MB of Hindi text data initially. But when we use 18GB hindi text data for training PLbert the quality and naturalness of the generated samples of StyleTTS2 got degraded more than what we were getting with 30MB.

Respaired commented 1 month ago

Are you sure your Phonemization is correct? don't trust bootphon's Phonemzier (the one with Espeak backend). that is garbage for most languages.

if you're getting really bad results, most likely it has little to do with PL-Bert and ASR + JDC models. STTS2 is really robust in such cases.

SandyPanda-MLDL commented 1 month ago

@SoshyHayami thanks for your reply. Yes I have used Espeak phonimizer and Indicbert tokenizer. But my doubt is that if Espeak phonimizer and tokenizer doesn't work good for PL-BERT training (for Hindi language for any new language other than English) then can you suggest which phonimizer and tokenizer one should use (for Hindi)? even while training PL-BERT for Hindi we observed that the loss wasn't reducing much with respect to iteration numbers. We trained it for 1000000 iterations. Even I would share that the quality of the generated samples degraded more when we train the PL-BERT for 18 GB of Hindi data (earlier we trained with 30 MB Hindi data). I am quite sure that the quality of the samples got degraded because of PL-BERT only. JDCNet and ASR model we kept as it is. Even we tried for fine tuning the StyleTTS-2 model with Hindi data. But we ended up with generated samples having foreign accent (European/US accent) of Hindi. This makes it more unnatural.

patriotyk commented 1 month ago

You cannot use Indicbert tokenizer for PL-BERT because it is subword tokenizer. PL-BERT supports only word tokenizers. And it is problem for most languages differ from english. I don't know how about Hindi language but Ukrainian have a lot of word cases, so it requires very huge dictionary for tokenizer. Despite this I was able to train it. And it sounds perfectly, but size of pl-bert weights is gigabytes.

SandyPanda-MLDL commented 1 month ago

@patriotyk Sorry for creating confusion. I actually meant to say the I used Espeak ("hi" that is Hindi) as the global phonimizer in the PLbert preprocessing code the Indicbert tokenizer as shown below. This function belongs to the preprocessing part of PLbert model (as shared in the official Github repo)

from transformers import AutoModel, AutoTokenizer

global_phonemizer = phonemizer.backend.EspeakBackend(language='hi', preserve_punctuation=True, with_stress=True) tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert', keep_accents=True)

def phonemize_hindi(text, global_phonemizer, tokenizer): print(f'remove accent {remove_accents(text)}') text = normalize_text_Hindi(remove_accents(text)) words = tokenizer.tokenize(text) print(f'Tokenized text {words} and original text is {text}')

phonemes_bad = [global_phonemizer.phonemize([word], strip=True)[0] if word not in string.punctuation else word for word in words]
input_ids = []
phonemes = []

for i in range(len(words)):
    word = words[i]

    phoneme = phonemes_bad[i]
    print(f'tokenizer.encode(word)[0] is {tokenizer.encode(word)[1]} word is {word} type is {type(word)}')
    input_ids.append(tokenizer.encode(word)[1])
    phonemes.append(phoneme)

assert len(input_ids) == len(phonemes)
return {'input_ids' : input_ids, 'phonemes': phonemes}
SandyPanda-MLDL commented 1 month ago

@patriotyk, so as goes for PLbert training steps we first obtain the shard files and then obtain the Token mapper pickle file (pickle.dump(token_maps, handle)), then we train the PLbert model (as mentioned in the official Github repo of PLbert). I followed all the steps similar way and utilized tokenizer and global phonimizer as mentioned in my earlier comment.

patriotyk commented 1 month ago

tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert', keep_accents=True) tokenizer in this case will be sentencepiece tokenizer. Also you need to train ASR for your language because you use different language that uses different phonemes. Also the same phonemizer and symbols list should be used everywhere, for PL-BERT, ASR and Styletts2 training and inference

SandyPanda-MLDL commented 1 month ago

@patriotyk Thankyou very much for you insightful suggestion. Then training StyleTTS2 model for Hindi language seems a very lengthy process. PLbert training for 18GB of Hindi dataset (including Preprocessing) took almost one month (A1400 machine). Additionally training the ASR for Hindi I hope quite easier because there are some existing pretrained ASR for Hindi that we can finetuning based our requirements. Anyways thank you very much. I am now wondering whether I can use Indicbert for StyleTTS2 instead of PLbert and remove the Phoneme level information to speech synthesis by simple text (word level information) to speech synthesis.