Closed shuchitak closed 2 years ago
I've added plots of vad flag and vnr flag to the generated plots. The vad_flag or vnr_flag is 0 or 1 depending on the prediction value being greater than a threshold. I noticed that with the threshold set to 0.8, vnr_pred_flag was being set to 1 only about half the times when a keyword was present.
When I set threshold to 0.5, vnr_pred_flag is set to 1 during many more speech instances.
VAD output for this stream on the other hand seems to be constantly all over the place.
I reran the pipeline quick tests with threshold set to 0.5 and they pass on my local setup.
Next step is to run the full pipeline tests on Jenkins depending on when Jenkins is available for this testing.
I tested VNR vs VAD on another test stream. This is the test stream we use in our pipeline example. For this stream, the VAD output looks better than VNR and setting speech detection threshold to 0.5 for VNR makes things even worse.
With detection threshold of 0.8
With detection threshold 0.5
Sensory detects 4 keywords when using VAD for AGC. It detects 1 keyword with VAD+AGC+Threshold_0.5 and 3 keywords with VAD+AGC+Threshold_0.8. Based on this, reducing threshold doesn't look like a promising solution.
I ran the sw_avona full pipeline test with VNR+AGC+threshold_0.8 and VNR+AGC+threshold_0.5 and compared the results for what we get for VAD+AGC on head of develop. Attaching the comparison results. pipeline_results_agc_vnr.xlsx
AGC_VNR_THRESHOLD of 0.5 does better than 0.8, however not as well as the VAD results.
I picked up one test case InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dBTake1.wav for which VNR_threshold_0.5 (83 keywords) detects 5 keywords less than VAD (88 keywords). I've attached the pipeline output files that were sent to Sensory for both. VAD_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt [VNR_0.5_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt](https://github.com/xmos/sw_avona/files/8804268/VNR_0.5_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dBTake1.wav.txt)
On running both these files through Sensory, the keywords at following time instances are not detected when using VNR. 186210 186765 (1.00 sv) alexa 195555 196155 (1.00 sv) alexa 209700 210240 (1.00 sv) alexa 261030 261600 (1.00 sv) alexa 429000 429645 (1.00 sv) alexa
VNR_keywords_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt VAD_keywords_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt
I reran the pipeline test on a trimmed version (5 minutes long) of this stream and plotted the VNR and VAD gains, as well as the AGC output gain when using VNR and VAD.
Command to plot: python compare_vad_vnr.py --stdo pipeline_stdo_trim_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt
pipeline_stdo_trim_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt
Looking more closely at the plots, VNR misses a keyword at around 182 seconds (output_vnr_pred goes only up to 0.4), and the AGC gain drops to around 654 then. However, it does detect the next 4 keywords but the gain continues to be low. For the VAD though, the gain falls to 654 at the 182 seconds mark but goes back up during the next 4 keywords. On talking to @Allan-xmos, we think, to start, we need to find out why the gain is not following the VNR keyword detections. AGC needs to be tuned for VNR now.
Helpful presentation from Dan, https://xmosjira.atlassian.net/wiki/spaces/HYD/pages/1818034280/AGC+Loss+Control+-+Design+and+Tuning
Since the gain had fallen and was not going back up for VNR, I experimented by changing agc_config gain_dec from 0.87 to 0.99 to make the gain go down in smaller steps and the no. of keywords detected increased from 50 to 53 (compared to 54 with VAD)
I ran full pipeline tests overnight with this change and the keywords with VNR match VAD for this particular stream but some other streams show a lot less keywords detected, particularly the InHouse_XVF3510v080_v1.2_20190423_Loc1_Clean_XMOS_DUT1_80dB_Take1.wav stream that has 12 less keywords in alt arch when compared to VAD. pipeline_results_agc_vnr.xlsx
It seems that for streams with higher noise level, Sensory likes the voice level to be louder while for streams with low noise levels, Sensory likes voice level to be quieter.
Eg. For the InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav stream, where there's more noise in the output, AGC gain close to 1000 does well. pipeline_stdo_trim_InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_70dB__Take1.wav.txt.50.txt
For InHouse_XVF3510v080_v1.2_20190423_Loc1_Clean_XMOS_DUT1_80dB_Take1.wav stream, when processed with the alt arch pipeline there's very low noise level and VNR outputs really sharp peaks. For this stream, an AGC gain of. about 600 gets the best kwd score. pipeline_stdo_InHouse_XVF3510v080_v1.2_20190423_Loc1_Clean_XMOS_DUT1_80dB_Take1.wav.98.txt compare_vad_vnr.py.zip
The AGC seems tuned for VAD output and the false positives in VAD in high noise streams help in pulling up the gain and improving Sensory kwd score. We don't see similar false positives with VNR, so the gain not being set as high during the voice portions deteriorates the kwd score. I tried tuning the AGC to improve kwd scores for the noisy streams but those changes cause kwd scores in clean streams to go down.
In the voice pipeline performance meeting on 01/06, it was decided that the Acoustics team will take a look at adapting AGC to work with VNR.
I'm putting this issue on hold for now.
Results with Amazon WWE.
AGC+VNR with voice detection threshold set to 0.5
Alt-arch agc_vnr_0_5.csv
Prev-arch agc_vnr_0_5_prev_arch.csv
In general, Prev-arch with VNR+AGC is better than VAD+AGC.
Alt-arch has the following streams where VAD+AGC has a lower kwd score. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
Input | develop_results_Avona_alt_arch_xcore_Amazon_WR_250k.en-US | agc_vnr_0.5_results_Avona_alt_arch_xcore_Amazon_WR_250k.en-US | | Diff, agc+vnr vs vad+vnr -- | -- | -- | -- | -- InHouse_XVF3510v080_v1.2_20190423_Loc1_Clean_XMOS_DUT1_80dB_Take1.wav | 98 | 97 | | -1 InHouse_XVF3510v080_v1.2_20190423_Loc1_Clean_XMOS_DUT1_90dB_Take1.wav | 23 | 21 | | -2 InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise1_80dB__Take1.wav | 88 | 87 | | -1 InHouse_XVF3510v080_v1.2_20190423_Loc1_Noise2_80dB__Take1.wav | 30 | 29 | | -1 InHouse_XVF3510v080_v1.2_20190423_Loc2_Clean_XMOS_DUT1_90dB_Take1.wav | 14 | 13 | | -1 InHouse_XVF3510v080_v1.2_20190423_Loc3_Clean_XMOS_DUT1_80dB_Take1.wav | 97 | 96 | | -1 InHouse_XVF3510v080_v1.2_20190423_Loc3_Clean_XMOS_DUT1_90dB_Take1.wav | 18 | 15 | | -3 InHouse_XVF3510v080_v1.2_20190423_Loc3_Noise1_80dB__Take1.wav | 93 | 90 | | -3 InHouse_XVF3510v080_v1.2_20190423_Loc3_Noise2_65dB__Take1.wav | 99 | 98 | | -1 InHouse_XVF3510v080_v1.2_20190423_Loc3_Noise2_70dB__Take1.wav | 93 | 92 | | -1 InHouse_XVF3510v080_v1.2_20190423_Loc3_Noise2_80dB__Take1.wav | 38 | 37 | | -1
I have calc_vnr_pred() ported to IC now and I've added calls to it in both single and multi threaded pipeline examples. Instead of VAD output, I'm passing output_vnr_pred to AGC. All other pipeline code is unchanged. I ran the pipeline quick tests on my machine and one of the tests fails with the 19 keywords detected less than the pass threshold of 21. With VAD being sent to AGC, 21 keywords are detected. There are 25 keywords in this stream, roughly spaced at 5 second intervals with the first keyword at ~3 second mark.
I have a plotted the VAD output, output_vnr_pred and input_vnr_pred for every frame.
Also attached the output wav files (extension changed to .txt to attach) output_vnr_16.wav.txt output_vad_16.wav.txt
Sensory output with VAD ./spot_eval_exe/spot-eval_x86_64-apple-darwin -t model/spot-alexa-rpi-31000.snsr /Users/shuchitak/sandboxes/sw_avona_vnr/sw_avona/examples/bare-metal/pipeline_alt_arch/output_vad_16.wav 7995 8580 alexa 12645 13245 alexa 22035 22620 alexa 26790 27465 alexa 31515 32175 alexa 36240 36900 alexa 41010 41700 alexa 45765 46275 alexa 50415 50910 alexa 55080 55620 alexa 59565 60165 alexa 64140 64695 alexa 68775 69330 alexa 73425 73950 alexa 78030 78525 alexa 91845 92430 alexa 96540 97125 alexa 101220 101745 alexa 105885 106620 alexa 110670 111270 alexa 115335 115965 alexa
Sensory output with VNR 7980 8580 alexa 12645 13245 alexa 22035 22620 alexa 31515 32175 alexa 36240 36885 alexa 41010 41700 alexa 45765 46275 alexa 50415 50910 alexa 55080 55620 alexa 59565 60165 alexa 64140 64710 alexa 68775 69330 alexa 73425 73950 alexa 78030 78525 alexa 91845 92430 alexa 96540 97125 alexa 105885 106620 alexa 110670 111270 alexa 115335 115965 alexa
With VNR, the keywords at 26790 and 101220 ms are not detected.
Next steps: