xmos / fwk_voice

Voice Framework
Other
11 stars 19 forks source link

Test pipeline performance with AGC using VNR inference output instead of VAD output #255

Closed shuchitak closed 2 years ago

shuchitak commented 2 years ago

I have calc_vnr_pred() ported to IC now and I've added calls to it in both single and multi threaded pipeline examples. Instead of VAD output, I'm passing output_vnr_pred to AGC. All other pipeline code is unchanged. I ran the pipeline quick tests on my machine and one of the tests fails with the 19 keywords detected less than the pass threshold of 21. With VAD being sent to AGC, 21 keywords are detected. There are 25 keywords in this stream, roughly spaced at 5 second intervals with the first keyword at ~3 second mark.

I have a plotted the VAD output, output_vnr_pred and input_vnr_pred for every frame. plot_vad_vnr_InHouse_XVF3510v080_v1

Also attached the output wav files (extension changed to .txt to attach) output_vnr_16.wav.txt output_vad_16.wav.txt

Sensory output with VAD ./spot_eval_exe/spot-eval_x86_64-apple-darwin -t model/spot-alexa-rpi-31000.snsr /Users/shuchitak/sandboxes/sw_avona_vnr/sw_avona/examples/bare-metal/pipeline_alt_arch/output_vad_16.wav 7995 8580 alexa 12645 13245 alexa 22035 22620 alexa 26790 27465 alexa 31515 32175 alexa 36240 36900 alexa 41010 41700 alexa 45765 46275 alexa 50415 50910 alexa 55080 55620 alexa 59565 60165 alexa 64140 64695 alexa 68775 69330 alexa 73425 73950 alexa 78030 78525 alexa 91845 92430 alexa 96540 97125 alexa 101220 101745 alexa 105885 106620 alexa 110670 111270 alexa 115335 115965 alexa

Sensory output with VNR 7980 8580 alexa 12645 13245 alexa 22035 22620 alexa 31515 32175 alexa 36240 36885 alexa 41010 41700 alexa 45765 46275 alexa 50415 50910 alexa 55080 55620 alexa 59565 60165 alexa 64140 64710 alexa 68775 69330 alexa 73425 73950 alexa 78030 78525 alexa 91845 92430 alexa 96540 97125 alexa 105885 106620 alexa 110670 111270 alexa 115335 115965 alexa

With VNR, the keywords at 26790 and 101220 ms are not detected.

Next steps:

shuchitak commented 2 years ago

I've added plots of vad flag and vnr flag to the generated plots. The vad_flag or vnr_flag is 0 or 1 depending on the prediction value being greater than a threshold. I noticed that with the threshold set to 0.8, vnr_pred_flag was being set to 1 only about half the times when a keyword was present. threshold_0 8_plot_vad_vnr_ 'InHouse_XVF3510v080_v1', '2_20190423_Loc1_Noise2_70dB__Take1', 'wav'

When I set threshold to 0.5, vnr_pred_flag is set to 1 during many more speech instances. threshold_0 5_plot_vad_vnr_ 'InHouse_XVF3510v080_v1', '2_20190423_Loc1_Noise2_70dB__Take1', 'wav'

VAD output for this stream on the other hand seems to be constantly all over the place.

I reran the pipeline quick tests with threshold set to 0.5 and they pass on my local setup.

Next step is to run the full pipeline tests on Jenkins depending on when Jenkins is available for this testing.

shuchitak commented 2 years ago

I tested VNR vs VAD on another test stream. This is the test stream we use in our pipeline example. For this stream, the VAD output looks better than VNR and setting speech detection threshold to 0.5 for VNR makes things even worse.

With detection threshold of 0.8 threshold_0 8_plot_vad_vnr_pipeline_example_input wav

With detection threshold 0.5 threshold_0 5_plot_vad_vnr_pipeline_example_input wav

Sensory detects 4 keywords when using VAD for AGC. It detects 1 keyword with VAD+AGC+Threshold_0.5 and 3 keywords with VAD+AGC+Threshold_0.8. Based on this, reducing threshold doesn't look like a promising solution.

shuchitak commented 2 years ago

I ran the sw_avona full pipeline test with VNR+AGC+threshold_0.8 and VNR+AGC+threshold_0.5 and compared the results for what we get for VAD+AGC on head of develop. Attaching the comparison results. pipeline_results_agc_vnr.xlsx

AGC_VNR_THRESHOLD of 0.5 does better than 0.8, however not as well as the VAD results.

shuchitak commented 2 years ago

I picked up one test case InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dBTake1.wav for which VNR_threshold_0.5 (83 keywords) detects 5 keywords less than VAD (88 keywords). I've attached the pipeline output files that were sent to Sensory for both. VAD_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt [VNR_0.5_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt](https://github.com/xmos/sw_avona/files/8804268/VNR_0.5_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dBTake1.wav.txt)

On running both these files through Sensory, the keywords at following time instances are not detected when using VNR. 186210 186765 (1.00 sv) alexa 195555 196155 (1.00 sv) alexa 209700 210240 (1.00 sv) alexa 261030 261600 (1.00 sv) alexa 429000 429645 (1.00 sv) alexa

VNR_keywords_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt VAD_keywords_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt

I reran the pipeline test on a trimmed version (5 minutes long) of this stream and plotted the VNR and VAD gains, as well as the AGC output gain when using VNR and VAD. plot_vad_vnr_avona_example_bare_metal_pipeline_multi_thread xe

Command to plot: python compare_vad_vnr.py --stdo pipeline_stdo_trim_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt

pipeline_stdo_trim_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt

compare_vad_vnr.py.zip

Looking more closely at the plots, VNR misses a keyword at around 182 seconds (output_vnr_pred goes only up to 0.4), and the AGC gain drops to around 654 then. However, it does detect the next 4 keywords but the gain continues to be low. For the VAD though, the gain falls to 654 at the 182 seconds mark but goes back up during the next 4 keywords. On talking to @Allan-xmos, we think, to start, we need to find out why the gain is not following the VNR keyword detections. AGC needs to be tuned for VNR now.

shuchitak commented 2 years ago

Helpful presentation from Dan, https://xmosjira.atlassian.net/wiki/spaces/HYD/pages/1818034280/AGC+Loss+Control+-+Design+and+Tuning

shuchitak commented 2 years ago

Since the gain had fallen and was not going back up for VNR, I experimented by changing agc_config gain_dec from 0.87 to 0.99 to make the gain go down in smaller steps and the no. of keywords detected increased from 50 to 53 (compared to 54 with VAD) plot_vad_vnr_test_plot

I ran full pipeline tests overnight with this change and the keywords with VNR match VAD for this particular stream but some other streams show a lot less keywords detected, particularly the InHouse_XVF3510v080_v1.2_20190423_Loc1_Clean_XMOS_DUT1_80dB_Take1.wav stream that has 12 less keywords in alt arch when compared to VAD. pipeline_results_agc_vnr.xlsx

shuchitak commented 2 years ago

It seems that for streams with higher noise level, Sensory likes the voice level to be louder while for streams with low noise levels, Sensory likes voice level to be quieter.

Eg. For the InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav stream, where there's more noise in the output, AGC gain close to 1000 does well. pipeline_stdo_trim_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt.50.txt plot_vad_vnr_trim_InHouse_XVF3510v080_v1 2_20190423_Loc1_Noise1_70dB__Take1 wav

For InHouse_XVF3510v080_v1.2_20190423_Loc1_Clean_XMOS_DUT1_80dB_Take1.wav stream, when processed with the alt arch pipeline there's very low noise level and VNR outputs really sharp peaks. For this stream, an AGC gain of. about 600 gets the best kwd score. pipeline_stdo_InHouse_XVF3510v080_v1.2_20190423_Loc1_Clean_XMOS_DUT1_80dB_Take1.wav.98.txt compare_vad_vnr.py.zip plot_vad_vnr_trim_InHouse_XVF3510v080_v1 2_20190423_Loc1_Clean_XMOS_DUT1_80dB_Take1 wav

The AGC seems tuned for VAD output and the false positives in VAD in high noise streams help in pulling up the gain and improving Sensory kwd score. We don't see similar false positives with VNR, so the gain not being set as high during the voice portions deteriorates the kwd score. I tried tuning the AGC to improve kwd scores for the noisy streams but those changes cause kwd scores in clean streams to go down.

shuchitak commented 2 years ago

In the voice pipeline performance meeting on 01/06, it was decided that the Acoustics team will take a look at adapting AGC to work with VNR.

I'm putting this issue on hold for now.

shuchitak commented 2 years ago

Results with Amazon WWE.

AGC+VNR with voice detection threshold set to 0.5

Alt-arch agc_vnr_0_5.csv agc_vnr_0_5

Prev-arch agc_vnr_0_5_prev_arch.csv agc_vnr_0_5_prev_arch

In general, Prev-arch with VNR+AGC is better than VAD+AGC.

Alt-arch has the following streams where VAD+AGC has a lower kwd score. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Input | develop_results_Avona_alt_arch_xcore_Amazon_WR_250k.en-US | agc_vnr_0.5_results_Avona_alt_arch_xcore_Amazon_WR_250k.en-US |   | Diff, agc+vnr vs vad+vnr -- | -- | -- | -- | -- InHouse_XVF3510v080_v1.2_20190423_Loc1_Clean_XMOS_DUT1_80dB_Take1.wav | 98 | 97 |   | -1 InHouse_XVF3510v080_v1.2_20190423_Loc1_Clean_XMOS_DUT1_90dB_Take1.wav | 23 | 21 |   | -2 InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_80dB__Take1.wav | 88 | 87 |   | -1 InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise2_80dB__Take1.wav | 30 | 29 |   | -1 InHouse_XVF3510v080_v1.2_20190423_Loc2_Clean_XMOS_DUT1_90dB_Take1.wav | 14 | 13 |   | -1 InHouse_XVF3510v080_v1.2_20190423_Loc3_Clean_XMOS_DUT1_80dB_Take1.wav | 97 | 96 |   | -1 InHouse_XVF3510v080_v1.2_20190423_Loc3_Clean_XMOS_DUT1_90dB_Take1.wav | 18 | 15 |   | -3 InHouse_XVF3510v080_v1.2_20190423_Loc3_Noise1_80dB__Take1.wav | 93 | 90 |   | -3 InHouse_XVF3510v080_v1.2_20190423_Loc3_Noise2_65dB__Take1.wav | 99 | 98 |   | -1 InHouse_XVF3510v080_v1.2_20190423_Loc3_Noise2_70dB__Take1.wav | 93 | 92 |   | -1 InHouse_XVF3510v080_v1.2_20190423_Loc3_Noise2_80dB__Take1.wav | 38 | 37 |   | -1

Alt-arch is better with VNR+AGC than VAD+AGC for the following streams. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Input | develop_results_Avona_alt_arch_xcore_Amazon_WR_250k.en-US | agc_vnr_0.5_results_Avona_alt_arch_xcore_Amazon_WR_250k.en-US |   | Diff, agc+vnr vs vad+vnr -- | -- | -- | -- | -- InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_65dB_XMOS_DUT1_80dB_Take1.wav | 89 | 90 |   | 1 InHouse_XVF3510v080_v1.2_20190423_Loc2_Clean_XMOS_DUT1_80dB_Take1.wav | 93 | 94 |   | 1 InHouse_XVF3510v080_v1.2_20190423_Loc2_Noise2_80dB__Take1.wav | 27 | 30 |   | 3

shuchitak commented 2 years ago

Results with Amazon WWE.

AGC+VNR with voice detection threshold set to 0.8

Alt-arch agc_vnr_0_8.csv agc_vnr_0_8

Prev-arch agc_vnr_0_8_prev_arch.csv agc_vnr_0_8_prev_arch

Detection threshold of 0.5 performs better in the keywords tests than detection threshold of 0.8.

shuchitak commented 2 years ago

The pipeline examples in fwk_voice now have VNR output prediction, with voice detection threshold set to 0.5, passed to AGC. This change was merged to develop as part of the new IC PR https://github.com/xmos/fwk_voice/pull/321.