questions about model's translation-variance and ground truth bbox matching rate

gnoynait commented 8 years ago

Hi Wei,

I have spent quite some time on your SSD code. And I have some questions about the implementation.

Do the training image size and testing image size have to be the same?
How to guarantee the ground truth to match at least one prior bbox (with an IOU greater than 0.5)?

As for question 1, I think the centre of the prior bbox is not calculated correctly, which make the model translation-variant. It may cause some problem when the training image and testing image have different sizes.

As for question 2, I have run some experiments. I turns out that many ground truth bboxes have a greatest IOU with all prior bboxes under 0.5. Training the model with different random aspect ratio relieves the problem. Do you have any other method to solve the problem?

Thank you!

weiliu89 commented 8 years ago

Yes. The training and testing image size have to be the same. For example, if you are training on 300x300, then the maximum prior box is around 300. It won't do well on 500x500 image. Can you elaborate more on why the center of the prior box is not calculated correctly? I haven't paid too much attention to the details of padding/pooling effect.
Do you mean "with an IoU less than 0.5"? It is done at here (Bipartite matching).

Are those ground truth, which have less than 0.5 IoU with all prior boxes, usually small?

gnoynait commented 8 years ago

Thanks, Wei,

Let's take conv8_2 of SSD300 for an example. In the following picture, white area is the input image, blue dots are centers of grids' receptive field, and read dots are anchor centers of the SSD implementation. Suppose we are going to predict the position of a dot lying at the center of a receptive field, the model has to predict a relative coordinate, which is shown as an arrow, with respect to the anchor center. Because the distance between two anchor centers are different from the distance of two receptive field centers, the relative coordinate changes as the target dots move from one receptive field to another. Look at the arrows with different directions. To fix this problem, you can either set the anchor center to the receptive field center, or you can set them like faster-rcnn. In faster-rcnn, as shown in the following picture, although anchor centers are not at receptive field center, the the distance between two anchor centers is set to the distance of two receptive field centers. As a result, the relative position will be an constant when the target dot moves from one receptive filed center to another and the offset can be learned as a bias by the model.
I think bipartite matching in the implementation also have a problem. In Bipartite matching, for each prior bbox, a ground truth bbox will be assigned to it if it has the largest IOU among all unmatched ground truth. Once a ground truth bbox have matched to a prior bbox, it won't be considered by other prior bboxes. In the following picture, there are four prior bboxes(prior1 to 4) and two ground truth bboxes(object1 and object2). It is clear that we should match prior2 to object1 and prior4 to object2. However, in bipartite matching, when matching prior1, we find object1 is the best match, so we match them. When matching prior2, although object1 has a higher IOU, but ojbect1 has already been matched to prior1, so prior2 ends with matching object2. The better way to do it is to find the best prior bbox for each ground truth bbox. This is exactly the second step in PER_PREDICTION matching as long as the IOU is above 0.5 or the IOU threshold is set to 0.0. So I just remove the bipartite matching step in PER_PREDICTION. However, in my experiments, most IOU's are below 0.5. If we just set a smaller threshold, like 0.1, ground truth may be matched to an unsuitable prior bbox, for example, the ground truth bbox is beyond the receptive field of that grid. So it is crucial to guarantee prior bboxes and ground truth bboxes have higher IOU's.

weiliu89 commented 8 years ago

@gnoynait Thanks for the explanation. How do you calculate the center of receptive field? It might be easy to derive for VGG, but how about Inception style network? I haven't checked the details on how RPN places the anchor boxes. I tiled the prior boxes in a very simple and naive way, which seems not the optimal way as you pointed out. Do you see any improvement if you fix the prior box placement? I am wondering if that is more of an issue for small prior boxes (e.g. those on conv4_3).

2) It seems reasonable. I guess your dataset has many small objects? One possible way to solve it is to place more dense small prior boxes. Again, after you fix the bipartite matching bug, do you see any improvements?

Thanks again for spending time explaining these in details. I really appreciate it!

weiliu89 commented 8 years ago

@gnoynait

I looked at the code, and don't think your 2) point is correct. From here, these two loops are trying to find the best matching between all possible remaining <prior box, ground truth> pair. I don't think your second plot is correct. The code might not be the most efficient bipartite matching, but it should be correct. Correct me if I am wrong.

weiliu89 commented 8 years ago

@gnoynait

I also spent some time looking in details on how Faster R-CNN implemented their anchor boxes. In specific, generate_anchors.py generates a set of anchors and anchor_target_layer.py puts those anchors at each cell on a feature map.

The only different I see is that RPN's anchor boxes are centered at the top left corner of each cell, and SSD's default boxes are centered at the center of each cell. I don't think I agree with your plot 1.

gnoynait commented 8 years ago

Hi, Wei,

The most important difference is the distance between two anchor centers, which you missed. In faster-RCNN, the anchor center distance is set to feature stride, which is set as a layer parameter. Faster RCNN use conv5_3, whose feature stride is 16, so the parameter is set to 16. In SSD, the distance is set to 300 / 18 = 16.6. This may seem not serious, though. Let's have a look at a higher layer, conv8_2, whose feature stride is 128. But in SSD, the anchor center distance is calculated as 300 / 3 = 100, which is far from 128.
Obviously, I made a mistake when reading the BIPARTITE matching code. Myconcern is how to get higher IOU's when setting the prior bboxes. In my experiments, I find the IOU could be as low as 0.1 for some ground truths.

For receptive field center calculation, you can refer to the tutorial. The formula does not support dilated convolution, but it should be easy to derive. For multi-branch network, such as inception, the receptive field center is guaranteed to be the same in different branches, and the receptive field size is the maximum size of every branch.

weiliu89 commented 8 years ago

Hi Yong,

Thanks for the explanation.

1) I actually noticed this. But as the layer become coarser and coarser, the stride doesn't strictly follow 2x rule anymore, right? For example, conv7_2's feature map size is 5x5, conv8_2's feature map size is 3x3. I can use 3x3 kernel with stride of 2 and pad 1 to get conv8_2 from conv7_2, but can also use 3x3 kernel with stride of 1 and pad 0 to get the same size conv8_2. What should be the stride for later case? Would it be the same as conv7_2, that is 64? I probably need to handle it more carefully. Do you have any suggestion? It is not a problem for faster rcnn since the feature map it uses is still relatively very large. I think for large objects, the difference is relatively small, and SSD seems doing well on large objects.

Besides, do you think it is problematic to offset the center of default box to the center of a cell instead of the top left corner?

2) What are those ground truth boxes which have low (0.1) IoU? Are those small ground truth? Theoretically, the tiling of default boxes are better than the ones in PRN. Faster R-CNN has an advantage because the ROI pooling can help classify object better. But maybe with a better placement of default box w.r.t. the receptive field of a kernel, it can have same advantage of ROI pooling.

weiliu89 commented 8 years ago

On the other hand, Faster R-CNN can only use high resolution feature map, otherwise, ROI pooling will have problem (many boxes will collapse in a single bin).

weiliu89 / caffe

questions about model's translation-variance and ground truth bbox matching rate #200