the setting of first sub-network fatt in training script

KongMingxi commented 6 years ago

Hi, Thank you very much for your code. I'm interested in this paper. I'm reading your code and I have two questions about the setting of first sub-network fatt.

As the second paragraph of section 3.2 described, the attention estimator fatt is modeled as 3 convolution layers with 512 kernels of 1×1, 512 kernels of 3×3, and C kernels of 1×1, respectively. When I read the training code "step_2_resnet101_att_trainval.prototxt", I think that the att.att_Conv_1, att.att_Conv_2 and att.att_Conv_3 correspond to the above 3 convolution layers and the output of caffe.Eltwise_304 or att.ReLU_2 correspond to the feature map from "res4b22_relu" as described in section 3.1. But there are att.Conv_3, att.Conv_6, att.Conv_9, att.Conv_13, att.Conv_16 and att.Conv_19 at the bottom of att.att_Conv_1. I want to know which part of the paper they corresponding to and what the function of them.

Another question is about the "conv1" in Fig2. It is modeled as a convolution layer with C kernels of size 1 × 1. I think the att.feat_Conv_1 and att.feat_Conv_2 correspond to the "conv1". Why are two convolution layers with kernel of 1 × 1 used for the "conv1"?

Thank you!

zhufengx commented 6 years ago

Hi, Thank you very much for your code. I'm interested in this paper. I'm reading your code and I have two questions about the setting of first sub-network fatt.

As the second paragraph of section 3.2 described, the attention estimator fatt is modeled as 3 convolution layers with 512 kernels of 1×1, 512 kernels of 3×3, and C kernels of 1×1, respectively. When I read the training code "step_2_resnet101_att_trainval.prototxt", I think that the att.att_Conv_1, att.att_Conv_2 and att.att_Conv_3 correspond to the above 3 convolution layers and the output of caffe.Eltwise_304 or att.ReLU_2 correspond to the feature map from "res4b22_relu" as described in section 3.1. But there are att.Conv_3, att.Conv_6, att.Conv_9, att.Conv_13, att.Conv_16 and att.Conv_19 at the bottom of att.att_Conv_1. I want to know which part of the paper they corresponding to and what the function of them.

Another question is about the "conv1" in Fig2. It is modeled as a convolution layer with C kernels of size 1 × 1. I think the att.feat_Conv_1 and att.feat_Conv_2 correspond to the "conv1". Why are two convolution layers with kernel of 1 × 1 used for the "conv1"?

Thank you!

Hi, @KongMingxi, thank you for reading our paper. Indeed, there are some little mismatches between our paper and released code, while it does not influence the main framework. These mismatches were ignored in the paper just for simplicity of writing. I can give you more detailed explanations as follows: (1) For "f_att", the additional two res-blocks(regarding "att.Conv_3" to "att.Conv_19") are used to learn better feature maps of 14*14, specially, for larger receptive field and higher-level feature representations. The proposed attention module relies on better feature maps for attention and score map generation. (2) For "conv1", your understanding is absolutely right. Actually, in our early experiments, we have compared both designs (using one or two convs for "conv1"), and didn't notice any difference.

KongMingxi commented 6 years ago

Thank you very much for prompt reply.

zhufengx / SRN_multilabel

the setting of first sub-network fatt in training script #15