weiliu89 / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
4.77k stars 1.68k forks source link

default bounding boxes and predictions convolutional filters #321

Closed amoussawi closed 7 years ago

amoussawi commented 7 years ago

Hello @weiliu89 I want to make sure that I understood the paper correctly, because that part wasn't clear much (to me at least).

  1. the kernel size of the prediction feature layers are 256 and 128 and 1024 and 512. while it was specified on the paper that number of filter is equal to k(c+4), for k default bb and c classes, and for the case of SSD300x300, only 6(20+4)=144 are needed in case of 6 default bb and 20 classes. what are the remaining filters used for? are those are used also for predictions but applied on different feature maps? or do they generate feature maps?

  2. Are you using a 3x3xp convolutional filters for predictions because default bounding boxes span a tile of size 3x3 on the feature maps? that's what I noticed on the 4x4 and 8x8 feature maps examples.

  3. are there default bounding boxes that cover the corners of the feature maps? or do you apply padding to avoid missing feature maps corners?

  4. in case all possible matching between default bounding boxes and some groundtruth box had less than 0.5 jaccard overlap, all of them will be discarded even the one with highest overlap? as if the groundtruth box isn't there at all?

Thanks in advance :)

weiliu89 commented 7 years ago
  1. I am not sure if I understand your questions clearly. I guess you maybe confused by the input channel and output channel. k*(c+4) is the output channels. Input channel of a kernel depends on the feature map that the kernel is applied to. Besides, we used multiple feature maps to do predictions. Please check the paper for more details.

  2. 3x3 might see more context of the underlying objects.

  3. Yes, there are. We do pad. I think the net learns to distinguish regions outside of an image.

  4. We use 0.5 to determine true positives. The rest are negatives. We do hard negative mining to select some as negative samples.

amoussawi commented 7 years ago
  1. No I know the difference between the two. The thing is that I thought that, for example, "conv6_2" is doing the predictions, but after I visualized the network of SSD300 on Netscope, I noticed that the output of "conv6_2" is passed to "conv6_2_mbox_loc" and "conv6_2_mbox_conf" which apparently are doing the predictions. And their output size is 24, and 126 resp. and that makes sense now.

Thank you!