Add acknowledgement to Mobvoi

csukuangfj commented 2 years ago

The TCN model and the max-pooling loss are basically the same as the one used inside Mobvoi. Also, one of the contributors did his internship at Mobvoi.

I would recommend adding acknowledgement to Mobvoi in README.md.

robin1001 commented 2 years ago

Thanks for your kind remind.

The project is mainly based on https://github.com/jingyonghou/KWS_Max-pooling_RHE, which is published in "Mining Effective Negative Training Samples for Keyword Spotting". The author is jingyonghou, who is a Phd and expert in this area, please see https://scholar.google.com/citations?user=vqrIi3wAAAAJ&hl=en&oi=ao. He is also the core designer and developer of the project wenet-kws.

The term TCN is published in "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling", and open sourced at http://github.com/locuslab/TCN.

Max pooling training is published in "MAX-POOLING LOSS TRAINING OF LONG SHORT-TERM MEMORY NETWORKS FOR SMALL-FOOTPRINT KEYWORD SPOTTING", and actually we refined the loss as max-min-pooling in our implementation.

We will add the above reference later. However, we didn't refer any thing inside Mobvoi.

csukuangfj commented 2 years ago

I did not mean that you are using the name TCN and max-pooling loss.

What I mean is that the network architecture and parameters, i.e., the values of

number of depthwise CNN layers
kernel size of each layer
dilation of each layer
stride of each layer
number of channels for each layer

are all the same. My former colleague did lots of experiments to find the above values. As far as I know, the above network architecture and parameters have been in use long before jingyonghou started to do his internship at Mobvoi.

robin1001 commented 2 years ago

Is the model architecure exactly same to yours?

I don't think so, we did a lot of experiments with different model size. Please refer "Hello Edge: Keyword Spotting on Microcontrollers" and our expriments, the model size varies according to hardwares, scnarios. All the hyper parameters you mentioned above is tuned according to model size.

robin1001 commented 2 years ago

And all the parameter could be easily changed in yaml configure file. please see https://github.com/wenet-e2e/wenet-kws/blob/master/examples/hi_xiaowen/s0/conf/mdtc.yaml for demo.

jingyonghou commented 2 years ago

The TCN model and the max-pooling loss are basically the same as the one used inside Mobvoi. Also, one of the contributors did his internship at Mobvoi.

I would recommend adding acknowledgement to Mobvoi in README.md.

Thanks for your kind reminder again.

TCN (actually, the TCN is some thing similar to the TDNN) is widely used in KWS task.

And there is a series of work using max-pooling to do the KWS, including "Mining Effective Negative Training Samples for Keyword Spotting", which is done during my internship at Mobvoi and University of Washinton. And if you have read the above mentioned paper, you will find the implemented one here is a special case in the paper.

you can find the paper here: http://lxie.nwpu-aslp.org/papers/2020ICASSP_HJY.pdf. All the parameters can be find in the paper.

The dilation rate (1,2,4,8,1,2,4,8) and channel size are very common settings. Stride is 1, there is nothing to discuss. Because of the above dilation setting, it is straightforward to choose a number layer of 4 or 8.

Also, we we have thanked Cui Fan and Shen Li in the paper for their valuable suggestions to this work.

wenet-e2e / wekws

Add acknowledgement to Mobvoi #13