How was the training data processed?

Pita commented 3 years ago

Hello,

We're trying to evaluate allosaurus for a pronunciation trainer. But currently the results fluctuate a bit too much for it to be reliable. Is there any tips that you have to get more consistent results? How was the training data recorded and was it processed in some way (compressor, noise reduction, etc...)? With this information we could adjust our input data and might get better results.

Peter

xinjli commented 3 years ago

Hi,

Thanks for the question! What do you mean by the fluctuation? Are you using the current model for your evaluation or you are trying to fine-tune it?

Pita commented 3 years ago

I mean there are a lot of phonemes incorrectly detected. Seems to be especially consonants. Was trying to detect the English th sound. It's almost impossible. I'm using the built in model

Pita commented 3 years ago

Attached is an example file. It was generated by a google syntethic voice. It only recognizes on th sound ( θ ), but there should be 3. With my own recordings I could never reproduce a θ sound

google-th.wav.zip

xinjli commented 3 years ago

The model was trained by mixing many languages and many recording environments, so I would not be surprised if it fails to recognize a particular sound in a particular language.

We will release a couple of new models trained specifically on each of the major language including English (hopefully next month), so maybe you can try that model once released. That model should significantly increase the English accuracy.

For the current model, if you expand the topk candidates as mentioned in the README, it might give you some phones you want to get `$ python -m allosaurus.run --lang=eng -i google-th.wav --topk=5

t (0.339) θ (0.197) b (0.138) ð (0.074) s (0.058) | a (0.917) (0.054) ɑ (0.012) ɒ (0.009) e (0.002) | ɪ (0.915) (0.067) j (0.007) l (0.003) ɒ (0.002) | b (0.836) v (0.114) f (0.026) p (0.016) ð (0.002) | a (0.933) ɑ (0.026) ɔ (0.021) ʌ (0.007) ɛ (0.006) | ə (0.700) (0.290) a (0.006) ɔ (0.001) ɒ (0.001) | f (0.425) x (0.241) t (0.124) ɡ (0.087) k (0.051) | t (0.868) tʰ (0.107) k (0.010) (0.007) d (0.002) | a (0.893) ɑ (0.027) ɔ (0.019) ɛ (0.016) ɒ (0.011) | b (0.508) (0.388) p (0.072) t (0.010) ð (0.005) | ɹ (0.637) r (0.202) ɔ (0.046) uː (0.041) (0.023) | ɛ (0.396) a (0.139) æ (0.129) ə (0.103) e (0.096) | s (0.628) (0.089) t (0.088) x (0.081) θ (0.028) `

Pita commented 3 years ago

Hello @xinjli, did you already release a new model?

xinjli commented 3 years ago

Not yet, we hope to release the model this month

xinjli commented 3 years ago

The new model was released, hope it will be helpful :)

xinjli / allosaurus

How was the training data processed? #19