Open csukuangfj opened 2 years ago
Thanks for your kind remind.
The project is mainly based on https://github.com/jingyonghou/KWS_Max-pooling_RHE, which is published in "Mining Effective Negative Training Samples for Keyword Spotting". The author is jingyonghou, who is a Phd and expert in this area, please see https://scholar.google.com/citations?user=vqrIi3wAAAAJ&hl=en&oi=ao. He is also the core designer and developer of the project wenet-kws.
The term TCN
is published in "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling", and open sourced at http://github.com/locuslab/TCN.
Max pooling training
is published in "MAX-POOLING LOSS TRAINING OF LONG SHORT-TERM MEMORY NETWORKS FOR SMALL-FOOTPRINT KEYWORD SPOTTING", and actually we refined the loss as max-min-pooling in our implementation.
We will add the above reference later. However, we didn't refer any thing inside Mobvoi.
I did not mean that you are using the name TCN
and max-pooling
loss.
What I mean is that the network architecture and parameters, i.e., the values of
are all the same. My former colleague did lots of experiments to find the above values. As far as I know, the above network architecture and parameters have been in use long before jingyonghou started to do his internship at Mobvoi.
Is the model architecure exactly same to yours?
I don't think so, we did a lot of experiments with different model size. Please refer "Hello Edge: Keyword Spotting on Microcontrollers" and our expriments, the model size varies according to hardwares, scnarios. All the hyper parameters you mentioned above is tuned according to model size.
And all the parameter could be easily changed in yaml configure file. please see https://github.com/wenet-e2e/wenet-kws/blob/master/examples/hi_xiaowen/s0/conf/mdtc.yaml for demo.
The TCN model and the max-pooling loss are basically the same as the one used inside Mobvoi. Also, one of the contributors did his internship at Mobvoi.
I would recommend adding acknowledgement to Mobvoi in
README.md
.
Thanks for your kind reminder again.
TCN (actually, the TCN is some thing similar to the TDNN) is widely used in KWS task.
And there is a series of work using max-pooling to do the KWS, including "Mining Effective Negative Training Samples for Keyword Spotting", which is done during my internship at Mobvoi and University of Washinton. And if you have read the above mentioned paper, you will find the implemented one here is a special case in the paper.
you can find the paper here: http://lxie.nwpu-aslp.org/papers/2020ICASSP_HJY.pdf. All the parameters can be find in the paper.
The dilation rate (1,2,4,8,1,2,4,8) and channel size are very common settings. Stride is 1, there is nothing to discuss. Because of the above dilation setting, it is straightforward to choose a number layer of 4 or 8.
Also, we we have thanked Cui Fan and Shen Li in the paper for their valuable suggestions to this work.
The TCN model and the max-pooling loss are basically the same as the one used inside Mobvoi. Also, one of the contributors did his internship at Mobvoi.
I would recommend adding acknowledgement to Mobvoi in
README.md
.