How was the number of object per class in a single image is 8732 ?

weiliu89 / caffe

Caffe: a fast open framework for deep learning.

http://caffe.berkeleyvision.org/

Other

4.77k stars 1.68k forks source link

How was the number of object per class in a single image is 8732 ? #375

Open Walid-Ahmed opened 7 years ago

Walid-Ahmed commented 7 years ago

How was the number of object per class in a single image equals 8732 ? I understand we have 4 aspect ratio in 8 by 8 grid and 4 aspect ratio in 4 by 4 grid

So I calculated the number as 8x8x4+4x4x4=736

Wali

ByeonghakYim commented 7 years ago

You need to consider two different scale of 1x1 aspect ratio. then you can get 5776 from 38x38, 2166 from 19x19, 600 from 10x10, 150 from 5x5, 36 from 3x3 and 4 from 1x1, totally 8732

Walid-Ahmed commented 7 years ago

@ByeonghakYim Thanks a lot. when you mention 38x38, is this the grid size? and if so, does this mean that the 8 by 8 grid and 4 by 4 grid mentioned in paper are only examples and in real implementation 38x38 , 19x19, 5x5, 3x3 and 1x1 grids where used? I am sorry if I might be actually missing how it really works!

Walid

weiliu89 commented 7 years ago

38x38 is the grid size. 8x8 and 4x4 in Figure 1 is only for illustration purpose.

villanuevab commented 7 years ago

@ByeonghakYim, @weiliu89 thank you for the clarification. Why do the 38x38, 3x3, and 1x1 feature maps only have 4 anchor boxes per feature map cell, when the paper implies that all layers should have 6?

wk910930 commented 7 years ago

@villanuevab we have the similar question at https://github.com/weiliu89/caffe/issues/316.

villanuevab commented 7 years ago

@wk910930 yes, the reasoning in that answer (given by @weiliu89):

conv4_3 is much larger than other layers, using 4 on conv4_3 is to avoid having too many default bboxes

makes sense for conv4_3, since it is the largest feature map used for prediction i.e., would have many default bboxes. But what about for the 3x3 and 1x1 feature maps? Perhaps at this scale it does not make sense to have too many default bboxes either, since the features would be of such high dimension that the extra 2 aspect ratios would make minimal difference i.e., not add much in terms of capturing additional features.

@weiliu89 is this intuition correct?