Open jihwan722 opened 11 months ago
I mean conv_layers in the classifier in the below picture Is that in the code you provided?
Dear @jihwan722, Before the decoder, there is a convLSTM encoder. Table II refers to the output after the encoder. Did you include it in the pipeline?
I mean achieving a size of 64×37×59(Conv Block 0's output size) after passing through 32x112x176 is impossible with 3x3 conv layer with stride 1 and 2x2 Max pool with stride 2. Please provide your answer for this.
@jihwan722 , please use a 3x3 conv layer with stride 3 to shrink the input size from 112x176 to 37x59 first. Code about classifier is not included in main_outside.py.
Thanks for your reply!! I want to ask more about fusion two modules which are inside module and outside module.
First, the inside images are 25fps and outside images are 30fps. Plus, when we make trainloader we use training_data from dataset.py, but training_data_inside's has 1 n_samples per each video and training_data_outside's has 10 n_samples per each video.
So, we want to fusion two modules but number of data is different.
Can we get the classifier module code? gns453@kookmin.ac.kr
How can we combine two moduels?
@jihwan722 , please use a 3x3 conv layer with stride 3 to shrink the input size from 112x176 to 37x59 first. Code about classifier is not included in main_outside.py.
After we use a 3x3 conv layer with stride 3 without pooling layer, we use 3x3 conv layer with stride 1 and 2x2 max pooling with stride 2. Then we get the 17x28. It's different with the size in the paper 12x20. Please let me know the exact method. @yaorong0921
Dear @johook, could you get 37x59 first after the 3x3 conv layer with the stride of 3? Do you mean that you could not get the dimension given in Table 2?
Dear @jihwan722 , When fusing, the inside and outside videos are given into two different networks, which might take different numbers of frames. As given in Section IV-B (outside video), the input frames are all included in the time period before the second T using an interval L=5. In IV-C (in-cabin video), a 16-frame clip is given before the second T. We can understand how the classifier works as follows: each branch takes different frames before the second T to compute a feature; Then, two features are fused together, as described in Table 2.
We are running main_inside.py and main_outside.py, and I want to know if the content related to Conv_layer in main_outside.py is the same as the picture.
If not, achieving a size of 64×37×59 after passing through 3x112x176 is impossible with 3x3 conv layer with stride 1 and 2x2 Max pool with stride 2.
Please provide an answer. Thank you.