Closed NickleDave closed 2 years ago
should check ASR lit specifically
@yardencsGitHub looking more at relevant lit.
This paper from MARBL describes adaptive pooling operators for weakly-supervised sound event detection https://arxiv.org/pdf/1804.10070.pdf We are not in a weakly-supervised setting but their "related work" seems like a good place to start
Noticing they cite another paper from Parascandolo (we cite their BiLSTM paper) https://arxiv.org/pdf/1702.06286.pdf We should definitely cite this and you and I should discuss the experiments within together. Note they propose a "frequency max pooling" layer. I think we in effect have something similar by virtue of the fact that our pooling size is (8, 1) and so is our stride
so I think a simple experiment to do would be to change the pooling size so it includes time domain, e.g., use filters of size (8, 8). I predict this would impair accuracy and thus demonstrate (in a post-hoc way :innocent: :grimacing: ) why we chose this pooling operation
if you agree I can add "change pooling size" issues with experiment labels
The pooling step, as implemented in TensorFlow, reduced the chosen dimensions - effectively losing resolution. So, a choice of 1 temporal bin was made to avoid losing temporal resolution. This probably can be changed via setting the step.
This adds to the set of experiments we want to run. We should first make a pilot with BF to make sure it is merited.
@yardencsGitHub I have added language in the "proposed method" section of the introduction as well as in the Methods section that provides details about the pooling operation and cites relevant literature
I'm leaving this open for now because I think we could modify the diagram if possible to make the pooling shape / stride explicit
Revised language about this and summed up in response letter. I still think we could revise the figure to better show this but am closing this issue for now.
check literature for maxpool v global + average