Content of fine-tuning files?

I find the language a little bit unclear when describing the contents of the text and audio files used for fine-tuning.

First off, what are acceptable contents of an utterance file? Total silence? Only one phoneme? Only one word? A whole sentence?

Secondly, should the comma-separated list in the text file have timestamps? Or is it just a chronological list of phonemes in the associated wave file? Should this list have duplicates if the wave file has the same phoneme show up several times? CAN a wave file have the same phoneme show up several times?

Any answers appreciated!

xinjli / allosaurus

Content of fine-tuning files? #74