Closed ruby11dog closed 8 months ago
As i understand, only text is needed. Wav files are not needed.
when i look into the repo's Data/OOD_texts.txt file its formatted like this:
LibriTTS/train-clean-360/3072/155948/3072_155948_000007_000011.wav|ʌv kˈoːɹs bˈɑːksɪŋ ʃˌʊd biː ɛŋkˈɜːɹɪdʒd ɪnðɪ ˈɑːɹmi ænd nˈeɪvi.|1037
The format looks similar to whats required for the train_list.txt
with a format of
for the file path, in the train_list.txt
and val_list.txt
files its a relative path based on the root dir configured in the config. here, these are absolute paths, not sure if generally both is supported
however, in the project's README the following is noted about that file:
OOD_data: The path for out-of-distribution texts for SLM adversarial training. The format should be text|anything
so, i am also confused how to properly build such file
As the description says, it’s text|anything
, so it doesn’t have to be path or speaker id, it can be anything. What matters is just the text, not what comes after |
. The separator is just for coding purposes (also for convenience because this OOD file is exactly the training file I used for LibriTTS model).
Hey @yl4579
Thanks for your quick response, much appreciated
So we can just literally use any text and turn it into ipa as the 2nd column?
Is there a recommended size of that dataset relative to the other 2 files, or does it even make sense to have this being customized, or could the provided sample just be reused?
Thx
Yes, see https://github.com/yl4579/StyleTTS2/blob/main/meldataset.py#L98, basically if the path contains .wav
, it will use the 2nd column, otherwise it will use the first column.
OOD texts that are anything out of distribution. This is to improve the robustness of synthesized speech against texts that are widely different from the dataset, so depending on your need you can put anything you want the texts to be more robust against. I just used LibriTTS because it is public domain and big enough to cover most cases of audiobook readings.
Ohh I see, gotcha, that makes sense then.
Thx, sg sg
Hello, I'm training StyleTTS2 in mandarin. But I'm confused about the OOD data, the OOD data should be multi-speaker data? Or can be single-speaker data?