Anchor boxes exceeding the input

mnslarcher commented 4 years ago

Hi @zylo117 ,

How does this library handle the case where an anchor box exceeds the input? Keep the intersection?

For example, following the rule: anchor_boxes = (anchor_boxes_width, anchor_boxes_height) anchor_boxes[i] = anchor_scale anchor_ratios[i] (2 * pyramid_level) base_scale

with base_scale = 4 anchor_scale = 1.0 anchor_ratios = (1.0, 1.0)

at pyramid level 3, anchor_boxes will be (32, 32).

At pyramid level 7 anchor_boxes will be (512, 512), that means that only the central anchor box is completely include in the input.

If the anchor_ratios are (1.4, 0.7), some part of the anchor will always go outside the input (with level = 7).

If the answer is that we keep the intersection, what happen when the intersection between two anchors with the input is the same? imagine an anchor box = (512, 512) and another (1000, 1000), after taking the intersection with the image they are both the same.

Best, Mario

zylo117 commented 4 years ago

Anchors don't really exceeds the input. Because anchors are only concept, some guides that help bbox better regressing. So there won't be any intersection because they don't really exist and don't crop input into pieces.

zylo117 commented 4 years ago

And the second question, if there are two anchors, 512 and 1000, they might cover the same area if input size are small enough, but the regression values will be totally different. And if the anchors are too different from the gt boxes, the bbox head will be harder to regress. So the best way to solve this is to set the appropriate anchors according to the input size and the gt boxes sizes.

mnslarcher commented 4 years ago

Thanks @zylo117, my doubt arise from the fact that (correct me if I'm wrong) we assign labels to anchor boxes based on their IoU with ground truth boxes.

For example, for RetinaNet (and maybe also EfficientDet) if an anchor has an IoU < 0.4 it's assigned to background. For this reason some anchors will be systematically assigned to background if their size is too big.

But probably like you said if one choose reasonable values this never happen.

The only point that still cause me some confusion is that for EfficientDet0 with ratios 1x1 we have three anchors that are larger (or equal) in both direction than the input (one for every anchor scale [2 ** 0, 2 ** (1.0 / 3.0), 2 ** (2.0 / 3.0)]). This in my understanding is redundant given that at max a ground truth box will be large as the input.

zylo117 commented 4 years ago

For P7, yes, the anchors are probably bigger than input, but for P3-P6, you still need those two anchor scales to fit more targets

mnslarcher commented 4 years ago

For P7, yes, the anchors are probably bigger than input, but for P3-P6, you still need those two anchor scales to fit more targets

Thanks @zylo117 , it makes sense, in my opinion it is not foolish to think that an "anchor pruning" for certain levels can be useful, reducing the computational cost without affecting the results, but probably the advantage is too small to take the time to implement it (or maybe I'm still missing some important detail). This pruning obviously should depend on the ratios and size of the anchor.

zylo117 commented 4 years ago

that's an interesting point. It's possible to set different scales and ratios on different levels.

But then again, the anchors numbers are decreasing exponentially by the level. Let's say you remove those two redundant scales of P7 of efficientdet d0, then the anchors of P7 will drop from 4 4 3 3(144) to 4 4 1 3(48). However, There are 49104 anchors in total. The 96 less anchors shouldn't make any difference.

mnslarcher commented 4 years ago

Yes, that probably doesn't have any impact. To have a real saving one would need to understand if there are "dead" anchors in all the levels and if removing part of them leave the performance basically unchanged.

The nice thing is that a part of this analysis can be done without even training the model bust just studying how many ground truth boxes will have a IoU > user_threshold changing the set of anchor boxes in the different levels.

Cli98 commented 4 years ago

Yes, that probably doesn't have any impact. To have a real saving one would need to understand if there are "dead" anchors in all the levels and if removing part of them leave the performance basically unchanged.

The nice thing is that a part of this analysis can be done without even training the model bust just studying how many ground truth boxes will have a IoU > user_threshold changing the set of anchor boxes in the different levels.

@mnslarcher Actually, you will find something more interesting. For example, anchors of negative values.

Ekta246 commented 3 years ago

Correct me if I am wrong!

So the Efficientdet-D0 model requires (512, 512) as the input image size. My dataset consists images of size (600, 800)(H*W) input image size.

So 1) I resize the original image(600*800 ) to (512, 512) and scale the groundtruth bounding box accordingly. While observing the bbox prediction output I see the bbox coordinates are higher than 512.

For eg, consider the bbox as [x1,y1,x2,y2] format I observe the output bbox as [0,0,700,600] This means the output bbox is according to the original image and not the resized (512, 512) image.

2) I see out of 100 boxes 80 are in the wat discussed above. The shocking part is these boxes have higher classification scores like above 0.77 and all.

Am I understanding the concept correctly? Any help appreciated!

zylo117 commented 3 years ago

@Ekta246 Do you mean the boxes right after nms or the ones after invert_affine? Anyway, the predicted data will be decoded and clipped using this module. https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/8f45451c81673fecf51af39537c319b7f9d20521/efficientdet/utils.py#L38-L52

But before clipping, the coordinates may be outside of the resized image. And then after clipping, the boxes' coordinates of the resized image will be transformed to match the original image.

Ekta246 commented 3 years ago

@Ekta246 Do you mean the boxes right after nms or the ones after invert_affine? Anyway, the predicted data will be decoded and clipped using this module. https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/8f45451c81673fecf51af39537c319b7f9d20521/efficientdet/utils.py#L38-L52

But before clipping, the coordinates may be outside of the resized image. And then after clipping, the boxes' coordinates of the resized image will be transformed to match the original image.

I meant the final predictions(after nms) Also, should the input image to the EfficientDet-D0 model should be 512*512 compulsorily? So, what if my input image original size of the dataset is (800, 600)? None, of the side is divisible by 128. Should I consider resizing it to a square image?

zylo117 commented 3 years ago

@Ekta246 No. It will work as long as both width and height are divisible by 128 and not less than 128.

Ekta246 commented 3 years ago

So, I resize and pad the image and scale the bbox accordingly. But maybe I should do the resize and padding before in my dataset itself instead of applying resize transforms. Maybe, the model still takes the original image. I will check that out and come back to you.

zylo117 commented 3 years ago

@Ekta246 You don't need to transform your images beforehand.

zylo117 / Yet-Another-EfficientDet-Pytorch

Anchor boxes exceeding the input #314