yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.87k stars 407 forks source link

Possible bug in LJSpeech training data #108

Open danielmsu opened 10 months ago

danielmsu commented 10 months ago

Some sentences in LJSpeech dataset start with a quote, and seems like quotes are substituted for $ in such cases.

For example, check this file: LJ005-0077.wav (vocaroo link) Text from the original dataset: "expedient to introduce such measures and arrangements as shall not only provide for the safe custody, Text from Data/train_list.txt: LJ005-0077.wav|dˈɑːlɚɹ ɛkspˈiːdiənt tʊ ˌɪntɹədˈuːs sˈʌtʃ mˈɛʒɚz ænd ɚɹˈeɪndʒmənts æz ʃˌæl nˌɑːt ˈoʊnli pɹəvˈaɪd fɚðə sˈeɪf kˈʌstədi ,|0

There is dˈɑːlɚɹ in the beginning, but audio doesn't actually have this word. Same happens for other files that start with " (approx. 83 samples).

danielmsu commented 10 months ago

I have just stumbled upon some other examples, and seems like it also happens with quotes in the middle of a text.

LJ005-0068.wav: Text from LJSpeech metadata: sought to "rank the prisons they built among the most splendid buildings of the city or town." Phonemized text from train_list.txt: LJ005-0068.wav|sˈɔːt tə dˈɑːlɚ ɹˈæŋk ðə pɹˈɪzənz ðeɪ bˈɪlt ɐmˌʌŋ ðə mˈoʊst splˈɛndɪd bˈɪldɪŋz ʌvðə sˈɪɾi ɔːɹ tˈaʊn.dˈɑːlɚ|0

So, both quotes in this text were phonemized as dˈɑːlɚ which sounds like a word dollar. Is this intentional substitution or some kind of a bug?

Kreevoz commented 10 months ago

The phonemization+tokenization code that ships with StyleTTS2 correctly processes the original LJSpeech dataset sentence at least. Into :

sˈɔːt tuː `` ɹˈæŋk ðə pɹˈɪzənz ðeɪ bˈɪlt ɐmˌʌŋ ðə mˈoʊst splˈɛndɪd bˈɪldɪŋz ʌvðə sˈɪɾi ɔːɹ tˈaʊn . ''

So that code is working.

Whatever generated the example train_list.txt must have been using some other phonemizer? Or maybe didn't handle quotes properly. 🤔 It would be hilarious if the LJSpeech model had been trained with that odd typo in the training data!

For finetuning you can use the LibriTTS checkpoint as base. If the OOD_texts.txt file is any indication (it contains part of the LibriTTS dataset), then the processing of that dataset did not introduce that error into the transcripts, so it should be ok to use that checkpoint for further things.

yl4579 commented 10 months ago

Unfortunately it seems like a bug. I took the data directly from VITS repo (https://github.com/jaywalnut310/vits/blob/main/filelists/ljs_audio_text_test_filelist.txt.cleaned) without any scrutinization.

@Kreevoz I guess you are correct😂. I just tested it and the model can't pronounce "dollar" because of this bug (so dollar was mapped to silence in this model): https://vocaroo.com/1EK8v2wnFGUH (the text was LJSpeech model couldn't pronounce the word, "dollar", because of a bug in preprocessing of VITS's repo.)

I did preprocessing for OOD_texts.txt myself (which was also the training data for the LibriTTS model) and it works fine there, although I noticed that the hyphen character (-) was used as dash () which makes the model pause unnaturally when it shouldn't do so.

yl4579 commented 10 months ago

Maybe I'll redo the preprocessing of LJSpeech dataset and train a new model with corrected data file when I get time.

yl4579 commented 10 months ago

Unfortunately it seems like a bug. I took the data directly from VITS repo (https://github.com/jaywalnut310/vits/blob/main/filelists/ljs_audio_text_test_filelist.txt.cleaned) without any scrutinization.

@Kreevoz I guess you are correct😂. I just tested it and the model can't pronounce "dollar" because of this bug (so dollar was mapped to silence in this model): https://vocaroo.com/1EK8v2wnFGUH (the text was LJSpeech model couldn't pronounce the word, "dollar", because of a bug in preprocessing of VITS's repo.)

I did preprocessing for OOD_texts.txt myself (which was also the training data for the LibriTTS model) and it works fine there, although I noticed that the hyphen character (-) was used as dash () which makes the model pause unnaturally when it shouldn't do so.

I just checked VITS's dataset and found they didn't have dollar. I don't know how I ended up getting dollars there, so it's not VITS's problem but my problem. It gotta be fixed with a new model for sure.

yl4579 commented 10 months ago

@Kreevoz I found another problem. The quote in the LibriTTS dataset was actually "content", not ``content'': https://raw.githubusercontent.com/yl4579/StyleTTS2/main/Data/OOD_texts.txt, so the inference code for sentences with quotes is also wrong.

Kreevoz commented 10 months ago

Was the code for the tokenization changed during the development of StarTTS2? I think the implementation you settled with in the end is pretty good though.

ziyaad30 commented 10 months ago

Could it be caused by this: https://github.com/yl4579/StyleTTS2/blob/2c427fc45291d5a046d4d46eb0c99d97b0cc1606/meldataset.py#L23