I find the language a little bit unclear when describing the contents of the text and audio files used for fine-tuning.
First off, what are acceptable contents of an utterance file? Total silence? Only one phoneme? Only one word? A whole sentence?
Secondly, should the comma-separated list in the text file have timestamps? Or is it just a chronological list of phonemes in the associated wave file? Should this list have duplicates if the wave file has the same phoneme show up several times? CAN a wave file have the same phoneme show up several times?
I find the language a little bit unclear when describing the contents of the text and audio files used for fine-tuning.
First off, what are acceptable contents of an utterance file? Total silence? Only one phoneme? Only one word? A whole sentence?
Secondly, should the comma-separated list in the text file have timestamps? Or is it just a chronological list of phonemes in the associated wave file? Should this list have duplicates if the wave file has the same phoneme show up several times? CAN a wave file have the same phoneme show up several times?
Any answers appreciated!