xinjli / allosaurus

Allosaurus is a pretrained universal phone recognizer for more than 2000 languages
GNU General Public License v3.0
550 stars 86 forks source link

Does allosaurus handle mixed speech and non-speech data? #11

Closed dinkar--s closed 3 years ago

dinkar--s commented 3 years ago

Hi, Thanks again for a great program

I tried to run allosaurus on approximately a 15 minute TED talk and got the following error. From the same talk, I extracted a 5 second speech excerpt, and allosaurus seemed to work. Did allosaurus crash because the TED talk starts with about 12 seconds of music? Here's the error message:

python -m allosaurus.run -i ~/datasets/tedlium3-wav/NaliniNadkarni_2009.wav Traceback (most recent call last): File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/allosaurus/run.py", line 59, in phones = recognizer.recognize(args.input, args.lang) File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/allosaurus/app.py", line 56, in recognize tensor_batch_lprobs = self.am(tensor_batch_feat, tensor_batch_feat_len) File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/allosaurus/am/allosaurus_torch.py", line 88, in forward hidden_packsequence, = self.blstm_layer(pack_sequence) File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, **kwargs) File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 580, in forward self.num_layers, self.dropout, self.training, self.bidirectional) RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm

dinkar--s commented 3 years ago

Here's the output from the 5 second clip python -m allosaurus.run -i N*.wav ð ə m o w s t m ɪ s t ɪ ɹ i ə s p ɑ ɹ tʰ ʌ v f ɔ ɹ ə s ɪ z ð ə ʌ p ɹ̩ t ɹ i k æ n ə b̥ i

xinjli commented 3 years ago

Hi,

I did not saw this error before, did you input the entire TED talk into the recognizer? if yes, it might be the input audio is too long

dinkar--s commented 3 years ago

Thanks - will check that out - what is the maximum length of the audio?

dinkar--s commented 3 years ago

Thanks - that might be it - I created a 25 second clip including the music at the beginning - seemed to work

xinjli commented 3 years ago

I did not try very long audio myself, I think it depends on your PC's memory. less than 30 sec should usually work. You can split your talk into small pieces by some tools (e.g. voice activity detection)