xinjli / allosaurus

Allosaurus is a pretrained universal phone recognizer for more than 2000 languages
GNU General Public License v3.0
532 stars 85 forks source link

The timestamp of model 'interspeech21' is incorrect #62

Open owaski opened 2 years ago

owaski commented 2 years ago

I run the following command:

python -m allosaurus.run --timestamp=True -i sample.wav -m interspeech21

and it gives me

0.040 0.025 ɑ 0.080 0.025 l 0.100 0.025 ʌ 0.120 0.025 s 0.140 0.025 o 0.170 0.025 ɹ 0.180 0.025 ə 0.200 0.025 s

This is incorrect for the sample audio. Seems the window shift is set wrongly.

SlistInc commented 2 years ago

I am struggling with the timing as well. Is anybody aware of any library able to do a forced alignment of phonemes based on the input from allosaurus? I would really appreciate any input and tipps on how I can improve the output from allosaurus.

JourneyToSilius commented 2 years ago

I am also looking for something like this

xinjli commented 2 years ago

Hi guys, sorry I was a bit busy with other projects and my internship in the last few months and did not have time to look at it.

I forgot to count the subsampling factor from the conv layer, i fixed it in the latest commit.

kzgajos commented 1 year ago

A very useful library -- thank you for creating it. I also have a timing issue. The onset of the phonemes seems to be reported correctly, but the duration of each shows as 0.045 regardless of how long each phoneme actually is. I need to detect pauses so accurate durations would be very helpful. Here's the output I get:

0.840 0.045 ʔ 0.870 0.045 a 0.900 0.045 l̪ 0.960 0.045 t̪ 0.990 0.045 ɒ 1.080 0.045 k͡p̚ 1.140 0.045 a 1.260 0.045 t̪ 1.320 0.045 ɒ 1.380 0.045 t̪ 1.440 0.045 ɒ 1.470 0.045 k