Closed amoussawi closed 7 years ago
I am not sure if I understand your questions clearly. I guess you maybe confused by the input channel and output channel. k*(c+4) is the output channels. Input channel of a kernel depends on the feature map that the kernel is applied to. Besides, we used multiple feature maps to do predictions. Please check the paper for more details.
3x3 might see more context of the underlying objects.
Yes, there are. We do pad. I think the net learns to distinguish regions outside of an image.
We use 0.5 to determine true positives. The rest are negatives. We do hard negative mining to select some as negative samples.
Thank you!
Hello @weiliu89 I want to make sure that I understood the paper correctly, because that part wasn't clear much (to me at least).
the kernel size of the prediction feature layers are 256 and 128 and 1024 and 512. while it was specified on the paper that number of filter is equal to k(c+4), for k default bb and c classes, and for the case of SSD300x300, only 6(20+4)=144 are needed in case of 6 default bb and 20 classes. what are the remaining filters used for? are those are used also for predictions but applied on different feature maps? or do they generate feature maps?
Are you using a 3x3xp convolutional filters for predictions because default bounding boxes span a tile of size 3x3 on the feature maps? that's what I noticed on the 4x4 and 8x8 feature maps examples.
are there default bounding boxes that cover the corners of the feature maps? or do you apply padding to avoid missing feature maps corners?
in case all possible matching between default bounding boxes and some groundtruth box had less than 0.5 jaccard overlap, all of them will be discarded even the one with highest overlap? as if the groundtruth box isn't there at all?
Thanks in advance :)