[Need help] I can't figure out a way to raise the "Weighted Train Accuracy"

Dear sir @yonatankarimish , I've made a recorder.ipynb in the GoogleDrive/ColabData/YonaVox, so I can record my phrases for 50 times on my android phone chrome browser, and with 1 second mute between every phrases with the "mute" button.

And I've recorded 6 tracks in the GoogleDrive/ColabData/YonaVox/ac_audio/phrases: gai4_ge1_qing3_hui2_da2.wav gai4_ge1_qing3_da2_hui2.wav gai4_ge1_hui2_qing3_da2.wav ge1_gai4_hui2_da2_qing3.wav ge1_gai4_da2_hui2_qing3.wav ge1_gai4_da2_qing3_hui2.wav

And I've modified GoogleDrive/ColabData/YonaVox/Phoneme_spectrogramcreator(public_version).ipynb to extract only the first 50 phrases into hebrew_speech_train and hebrew_speech_test. I've test my modification, it seems work perfectly: Screenshot 2023-01-20 at 12-59-47 Google Colaboratory

I've runned with your 2 ipynb and your recorded phrases, the "Weighted Train Accuracy" can arrive 0.93, I think mandarin pinyin syllable is far more simpler than english and hebrew, because there's no tailing consonants in mandarin syllable, for example: hei ba i t ka be ma z gan (in hebrew, hey open airconditioner) vs hei da kai kong tiao(in mandarin, hey open airconditioner), so I think that mandarin pinyin syllable recognition will be more easier than english and hebrew.

But after I try it in person, I find I can't figure out a way to raise my "Weighted Train Accuracy", it's always staying at 0.62 :

Model metrics for epoch 1458: 
Test Loss: 0.9432518482208252
Weighted Train Accuracy: 0.6216300573204951
Weighted Test Accuracy: 0.5801774095772347
Training time was 0.8528854846954346 sec

Model metrics for epoch 1459: 
Test Loss: 0.9433050751686096
Weighted Train Accuracy: 0.6216300573204951
Weighted Test Accuracy: 0.5801774095772347
Training time was 0.8505825996398926 sec

Model metrics for epoch 1460: 
Test Loss: 0.9433295726776123
Weighted Train Accuracy: 0.6216300573204951
Weighted Test Accuracy: 0.5801774095772347
Training time was 0.8507113456726074 sec

I've shared my whole YonaVox googledrive folder to your gmail with editor permissions, Do you have time to have a look?

@diyism Thank you for sharing your Jupyter notebooks with me.

First of all, I wish to apologize for not fully clarifying that the augmentations in the preprocessing notebook were not used in training the models committed to the GitHub project (background noise, background music, spectrogram masking, dynamic time warping etc...). While I stated this in the paper (https://arxiv.org/ftp/arxiv/papers/2103/2103.13997.pdf), I forgot to note that in the spectrogram creation notebook.

It should be possible to obtain unaugmented spectrograms by running the notebook up to where they are first saved to disk. I will properly clarify this in future versions of the notebook.

Regarding the model's failure to converge - there are a few steps I would try in your place:

1) Try and overfit the model (on purpose) to correctly classify just a single spectrogram. If this does not work, the issue is probably model-related.

2) Record more phrases. The original training set had ~3100 phrases (unaugmented), which was of sufficient size for the model to learn from. According to your notebook, the Mandarin training set has just 60 phrases (before augmentations). While it could be possible for the model to properly learn on a smaller training set, it seems like recording just five examples of every phrase might not be enough.

(as a side note, Note that the augmentations themselves not only make it harder for the model to correctly predict a phoneme sequence, but might sometimes even alter the sound that the spectrogram represents. For example - reconstructing an augmented spectrogram using a method such as Griffin-Lim might result in distorted or different phrases being spoken.)

3) Ensure there is a long enough break between subsequent phrases in each recording. While in your case it seems the preprocessing was able to properly extract the recorded phrases, the VAD (voice-activity detection) algorithm used is very simple and you might not be as lucky with recordings containing more phrases unless properly spaced.

4) Properly annotate the tracks. In your case, the tracks were not separated into the smallest possible phonemes, which can adversely affect the model's ability to predict correct sequences.

I have annotated one of your recordings as an example:

5) Record your tracks without background noises. Other speakers can be heard in some of your tracks. While this might not be easy at times, do understand this has an effect on the data quality.

6) Inspect your confusion matrix to see if the model repeatedly mistakes a particular phoneme for another. This can be corrected either by adding more examples of the mistaken and correct phonemes and/or by making sure the phonemes have distinct and discernible pronunciations in the recordings.

I hope this helps you to obtain better data, and get your model to converge :)

Indeed I didn't use the augmentation part (just after the link "https://arxiv.org/pdf/1904.08779.pdf" in the file Phoneme_spectrogramcreator(public_version).ipynb), so my problem should be in your mentioned 6 points, I'll try to improve my recording quality and other points.

Thanks for your detailed reply, best wish to you.

After added some single-phoneme wav files, I've successfully raised the Weighted Train Accuracy from 0.62 to 0.88. Screenshot 2023-04-05 at 00-16-05 phrases - Google Drive

So I replaced the YonaVox/app/src/main/assets/vox_decoder.pt and vox_decoder.pt with the new generated version, and modified the populateTokenMaps() function according the 5th step of Hebrew_AC_voiceactivation(public_version).ipynb: Screenshot 2023-04-05 at 00-23-50 2023-04-05-002208_1920x1080_scrot png (PNG Image 1920 × 1080 pixels) — Scaled (74%)

And added debug log in the transcribe() function: Screenshot 2023-04-05 at 00-30-38 2023-04-05-003022_1920x1080_scrot png (PNG Image 1920 × 1080 pixels) — Scaled (74%)

Then build and run the app on my android phone, and say "gai4" to my phone, but the android studio logcat prints:

I/System.out: Speech detected!
I/System.out: Low-pass + Downsample took 16 ms
I/System.out: Spectrogram creation took 92 ms
E/System: Ignoring attempt to set property "file.encoding" to value "UTF-8".
I/System.out: Log-mel conversion took 111 ms
I/System.out: Finished converting audio => spectrogram
W/Thread-2: type=1400 audit(0.0:578839): avc: denied { read } for name="u:object_r:vendor_default_prop:s0" dev="tmpfs" ino=26790 scontext=u:r:untrusted_app:s0:c154,c257,c512,c768 tcontext=u:object_r:vendor_default_prop:s0 tclass=file permissive=0
E/libc: Access denied finding property "ro.hardware.chipname"
W/native: [W TensorImpl.h:930] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator())
I/System.out: ==================predictedSyllable: noise
I/System.out: ==================predictedSyllable: end
    Phoneme prediction took 48 ms

Have I done something wrong?

Sorry for bothering you again.

I'm dreaming of a real-time syllable recognition engine that is as precise as a mechanical keyboard, so I can send the recognized syllables to large language models like ChatGPT-4 which will certainly result in optimal text analysis. ref: https://github.com/openai/whisper/discussions/318#discussioncomment-5499879

Unfortunately, such an engine does not yet exist, so I have to come back to seek your help, I'm planning to train the all 1300 syllables in mandarin.

I've just found the project https://github.com/k2-fsa/sherpa-ncnn and https://k2-fsa.github.io/sherpa/ncnn/android/build-sherpa-ncnn.html
Its shared object only occupy 3MB, but its model occupy 230MB, I should find if its trainning is as easy as YonaVox.

yonatankarimish / YonaVox

[Need help] I can't figure out a way to raise the "Weighted Train Accuracy" #2