xinjli / allosaurus

Allosaurus is a pretrained universal phone recognizer for more than 2000 languages
GNU General Public License v3.0
550 stars 86 forks source link

Phone duration is always 0.045 #38

Open artrayd opened 3 years ago

artrayd commented 3 years ago

No matter what, phone duration is 0.045 that doesn't sound right. Even if I say something like "Ooooooooh yeeeeees"

4.080 0.045 iː 4.320 0.045 tʲ 4.410 0.045 iː

xinjli commented 3 years ago

this is an issue caused by the loss function used in the model (CTC), unfortunately, it is known to have this peaky issue and cannot be fixed.

artrayd commented 3 years ago

Hi @xinjli thank you for your answer. Another question, is that theoretically possible to mark pauses? A time when there is no voice at all?

willstott101 commented 2 years ago

I don't know what your use-case is @artrayd but we've had success combining the current audio volume with allosaurus output for generating animated lipsync. It patches over any sounds allosaurus does not recognise, and helps us respond to silence correctly.

However, if you use-case does not have clean enough audio for that, we have seen that allosaurus is remarkably good at just not outputting anything for periods with non-speech sounds. So simply finding gaps of a particular length in the allosaurus output may well be suitable.

artrayd commented 2 years ago

@willstott101 thank you! I was thinking in the same direction.

62mkv commented 2 years ago

I suspect that due to this issue, allosaurus "swallows" multiple phones when the speech is rather quick (in Estonian, for example, native speakers tend to produce sounds quickly because the words are so long). Might this be the case? If so, what is an CTC model, where do I learn more about it?

62mkv commented 2 years ago

(I suspect that -e option might be intended to compensate for that "fixed duration" thingy.. unfortunately it does not seem to do any better for the overall outcome)