yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.48k stars 355 forks source link

Confused about OOD data preparation #56

Closed ruby11dog closed 8 months ago

ruby11dog commented 8 months ago

Hello, I'm training StyleTTS2 in mandarin. But I'm confused about the OOD data, the OOD data should be multi-speaker data? Or can be single-speaker data?

WendongGan commented 8 months ago

As i understand, only text is needed. Wav files are not needed.

devidw commented 8 months ago

when i look into the repo's Data/OOD_texts.txt file its formatted like this:

LibriTTS/train-clean-360/3072/155948/3072_155948_000007_000011.wav|ʌv kˈoːɹs bˈɑːksɪŋ ʃˌʊd biː ɛŋkˈɜːɹɪdʒd ɪnðɪ ˈɑːɹmi ænd nˈeɪvi.|1037

The format looks similar to whats required for the train_list.txt with a format of

for the file path, in the train_list.txt and val_list.txt files its a relative path based on the root dir configured in the config. here, these are absolute paths, not sure if generally both is supported

however, in the project's README the following is noted about that file:

OOD_data: The path for out-of-distribution texts for SLM adversarial training. The format should be text|anything

so, i am also confused how to properly build such file

yl4579 commented 8 months ago

As the description says, it’s text|anything, so it doesn’t have to be path or speaker id, it can be anything. What matters is just the text, not what comes after |. The separator is just for coding purposes (also for convenience because this OOD file is exactly the training file I used for LibriTTS model).

devidw commented 8 months ago

Hey @yl4579

Thanks for your quick response, much appreciated

So we can just literally use any text and turn it into ipa as the 2nd column?

Is there a recommended size of that dataset relative to the other 2 files, or does it even make sense to have this being customized, or could the provided sample just be reused?

Thx

yl4579 commented 8 months ago

Yes, see https://github.com/yl4579/StyleTTS2/blob/main/meldataset.py#L98, basically if the path contains .wav, it will use the 2nd column, otherwise it will use the first column.

OOD texts that are anything out of distribution. This is to improve the robustness of synthesized speech against texts that are widely different from the dataset, so depending on your need you can put anything you want the texts to be more robust against. I just used LibriTTS because it is public domain and big enough to cover most cases of audiobook readings.

devidw commented 8 months ago

Ohh I see, gotcha, that makes sense then.

Thx, sg sg