xinjli / allosaurus

Allosaurus is a pretrained universal phone recognizer for more than 2000 languages
GNU General Public License v3.0
571 stars 88 forks source link

Not able to transcribe simple word what in English #60

Closed FilBot3 closed 2 years ago

FilBot3 commented 2 years ago

The issue

I am currently trying to use Allosaurus to help a Speech Language Pathologist perform transcriptions, but I am having issues with getting the application to recognize the word what let along longer WAV files with more complex sentences in them. Attached is the WAV file. The output I get from Allosaurus is:

~/Downloads❯ python -m allosaurus.run -i what.wav --model eng2102 --lang eng

~/Downloads❯ 

I even installed the eng2102 model.

~/Downloads❯ python -m allosaurus.bin.list_model
Available Models
- uni2005 (default)
~/Downloads❯ python -m allosaurus.bin.download_model -m eng2102
downloading model  eng2102
from:  https://github.com/xinjli/allosaurus/releases/download/v1.0/eng2102.tar.gz
to:    /home/filbot/.local/lib/python3.9/site-packages/allosaurus/pretrained
please wait...
~/Downloads❯ python -m allosaurus.bin.list_model               
Available Models
- uni2005 (default)
- eng2102

It was recorded using a Tascam DR-40X using WAV 32bit then transferred over to a Pop!_OS Linux System.

Python Version

~/Downloads❯ python -V
Python 3.9.7

Pop!_OS Version

~/Downloads❯ neofetch 
             /////////////                filbot@pop-os 
         /////////////////////            ------------- 
      ///////*767////////////////         OS: Pop!_OS 21.10 x86_64 
    //////7676767676*//////////////       Host: Oryx Pro oryp6 
   /////76767//7676767//////////////      Kernel: 5.15.23-76051523-generic 
  /////767676///*76767///////////////     Uptime: 1 hour, 40 mins 
 ///////767676///76767.///7676*///////    Packages: 2857 (dpkg), 90 (flatpak) 
/////////767676//76767///767676////////   Shell: zsh 5.8 
//////////76767676767////76767/////////   Resolution: 1920x1080 
///////////76767676//////7676//////////   DE: GNOME 40.5 
////////////,7676,///////767///////////   WM: Mutter 
/////////////*7676///////76////////////   WM Theme: Pop 
///////////////7676////////////////////   Theme: Pop-dark [GTK2/3] 
 ///////////////7676///767////////////    Icons: Pop [GTK2/3] 
  //////////////////////'////////////     Terminal: gnome-terminal 
   //////.7676767676767676767,//////      CPU: Intel i7-10875H (16) @ 5.100GHz 
    /////767676767676767676767/////       GPU: Intel CometLake-H GT2 [UHD Graphics] 
      ///////////////////////////         Memory: 3052MiB / 31977MiB 
         /////////////////////
             /////////////               

what.wav file. what.wav.zip

The question

I feel like I'm not doing something correctly. Do I need to train allosaurus to listen for English sounds as well? I expect to see something similar to wʌt

xinjli commented 2 years ago

Hi,

It looks like your file is in the 32 bit precision and the original model was assuming 16 bit precision. So you need to first transform it into 16bit using some command like sox what.wav -b 16 what_out.wav

And then it will give you some good results

$ python -m allosaurus.run -i what_out.wav 
w ɒ tʲ
FilBot3 commented 2 years ago

Thank you, I'll try this out. To be a nit, I didn't see that defined on the README.

xinjli commented 2 years ago

yeah, I should mention it in the README, thanks for your reminder!

FilBot3 commented 2 years ago

Interesting, with ffmpeg, I think I did it correctly, but got a different result.

ffmpeg results

```bash ~/Downloads❯ ffmpeg -i what.wav -acodec pcm_s16le -ar 44100 -ac 1 what_16bit.wav ffmpeg version 4.4-6ubuntu5 Copyright (c) 2000-2021 the FFmpeg developers built with gcc 11 (Ubuntu 11.2.0-7ubuntu1) configuration: --prefix=/usr --extra-version=6ubuntu5 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared libavutil 56. 70.100 / 56. 70.100 libavcodec 58.134.100 / 58.134.100 libavformat 58. 76.100 / 58. 76.100 libavdevice 58. 13.100 / 58. 13.100 libavfilter 7.110.100 / 7.110.100 libswscale 5. 9.100 / 5. 9.100 libswresample 3. 9.100 / 3. 9.100 libpostproc 55. 9.100 / 55. 9.100 Guessed Channel Layout for Input Stream #0.0 : stereo Input #0, wav, from 'what.wav': Duration: 00:00:00.47, bitrate: 2823 kb/s Stream #0:0: Audio: pcm_s32le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s32, 2822 kb/s File 'what_16bit.wav' already exists. Overwrite? [y/N] y Stream mapping: Stream #0:0 -> #0:0 (pcm_s32le (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, wav, to 'what_16bit.wav': Metadata: ISFT : Lavf58.76.100 Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, mono, s16, 705 kb/s Metadata: encoder : Lavc58.134.100 pcm_s16le size= 40kB time=00:00:00.46 bitrate= 710.3kbits/s speed=87.4x video:0kB audio:40kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.189532% ~/Downloads❯ python -m allosaurus.run -i what_16bit.wav w tʲ ```

versus my sox results.

```bash ~/Downloads❯ sudo apt install sox Reading package lists... Done Building dependency tree... Done Reading state information... Done The following additional packages will be installed: libsox-fmt-alsa libsox-fmt-base libsox3 Suggested packages: libsox-fmt-all The following NEW packages will be installed: libsox-fmt-alsa libsox-fmt-base libsox3 sox 0 upgraded, 4 newly installed, 0 to remove and 1 not upgraded. Need to get 369 kB of archives. After this operation, 1,265 kB of additional disk space will be used. Do you want to continue? [Y/n] y Get:1 http://us.archive.ubuntu.com/ubuntu impish/universe amd64 libsox3 amd64 14.4.2+git20190427-2 [226 kB] Get:2 http://us.archive.ubuntu.com/ubuntu impish/universe amd64 libsox-fmt-alsa amd64 14.4.2+git20190427-2 [10.5 kB] Get:3 http://us.archive.ubuntu.com/ubuntu impish/universe amd64 libsox-fmt-base amd64 14.4.2+git20190427-2 [31.5 kB] Get:4 http://us.archive.ubuntu.com/ubuntu impish/universe amd64 sox amd64 14.4.2+git20190427-2 [102 kB] Fetched 369 kB in 0s (831 kB/s) Selecting previously unselected package libsox3:amd64. (Reading database ... 468084 files and directories currently installed.) Preparing to unpack .../libsox3_14.4.2+git20190427-2_amd64.deb ... Unpacking libsox3:amd64 (14.4.2+git20190427-2) ... Selecting previously unselected package libsox-fmt-alsa:amd64. Preparing to unpack .../libsox-fmt-alsa_14.4.2+git20190427-2_amd64.deb ... Unpacking libsox-fmt-alsa:amd64 (14.4.2+git20190427-2) ... Selecting previously unselected package libsox-fmt-base:amd64. Preparing to unpack .../libsox-fmt-base_14.4.2+git20190427-2_amd64.deb ... Unpacking libsox-fmt-base:amd64 (14.4.2+git20190427-2) ... Selecting previously unselected package sox. Preparing to unpack .../sox_14.4.2+git20190427-2_amd64.deb ... Unpacking sox (14.4.2+git20190427-2) ... Setting up libsox3:amd64 (14.4.2+git20190427-2) ... Setting up libsox-fmt-alsa:amd64 (14.4.2+git20190427-2) ... Setting up libsox-fmt-base:amd64 (14.4.2+git20190427-2) ... Setting up sox (14.4.2+git20190427-2) ... Processing triggers for libc-bin (2.34-0ubuntu3) ... Processing triggers for man-db (2.9.4-2) ... Processing triggers for mailcap (3.69ubuntu1) ... ~/Downloads❯ sox what.wav -b 16 what_16bit.wav sox WARN dither: dither clipped 2 samples; decrease volume? ~/Downloads❯ python -m allosaurus.run -i what_16bit.wav w ɒ tʲ ```

Thank you for showing me the way.

FilBot3 commented 2 years ago

Also seems like it needs to be down sampled to at least 16000 kbps

```bash ~/Downloads❯ ffmpeg -i what.wav -acodec pcm_s16le -ac 1 -ar 16000 what_16bit.wav ffmpeg version 4.4-6ubuntu5 Copyright (c) 2000-2021 the FFmpeg developers built with gcc 11 (Ubuntu 11.2.0-7ubuntu1) configuration: --prefix=/usr --extra-version=6ubuntu5 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared libavutil 56. 70.100 / 56. 70.100 libavcodec 58.134.100 / 58.134.100 libavformat 58. 76.100 / 58. 76.100 libavdevice 58. 13.100 / 58. 13.100 libavfilter 7.110.100 / 7.110.100 libswscale 5. 9.100 / 5. 9.100 libswresample 3. 9.100 / 3. 9.100 libpostproc 55. 9.100 / 55. 9.100 Guessed Channel Layout for Input Stream #0.0 : stereo Input #0, wav, from 'what.wav': Duration: 00:00:00.47, bitrate: 2823 kb/s Stream #0:0: Audio: pcm_s32le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s32, 2822 kb/s File 'what_16bit.wav' already exists. Overwrite? [y/N] y Stream mapping: Stream #0:0 -> #0:0 (pcm_s32le (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, wav, to 'what_16bit.wav': Metadata: ISFT : Lavf58.76.100 Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc58.134.100 pcm_s16le size= 15kB time=00:00:00.46 bitrate= 257.9kbits/s speed=62.1x video:0kB audio:15kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.522368% ```
~/Downloads❯ python -m allosaurus.run -i what_16bit.wav                          
w a tʰ