ylabbe / cosypose

Code for "CosyPose: Consistent multi-view multi-object 6D pose estimation", ECCV 2020.
MIT License
301 stars 89 forks source link

Question about data augmentation & each object results #1

Closed taeyeopl closed 4 years ago

taeyeopl commented 4 years ago

Thanks for sharing the good work. I have some simple questions about your method.

[Data Augmentation] Q1. Can I ask about the effect of data augmentation in the YCB dataset. (Table 1) In the T-LESS, They show significant decrease of performance (37% to 64%).
Have you compare the data-augmentation performance in the YCB??

Q2. Data augmentation show big difference than other factors (ex. loss,network,rot) I couldn't find the why this happened in the paper, Can I ask your opinion??

Q2. (Appendix A, Experimental findings) On YCB-Video, As I understood, Pre-training phase included data augmentation. Is it right??

[Each Object results] Q3. If possible, Could you share the results for each objects (ex. chef_can, cracker)

ylabbe commented 4 years ago

Hi @trevor-taeyeop,

Thanks for your interest in the paper.

Q1. I haven't done a proper data-augmentation ablation on YCB-Video using this version of the method. However while developing, I remember trying it a while ago without data augmentation, and the difference was far less important on YCB-Video compared to T-LESS. I think this is to be expected since we also use the real training images on YCB-Video. The YCB-Video training images are real images with multiple objects in real scenes whereas the "real" images of T-LESS are of isolated objects on black background. The sim2real gap that needs to be bridged is bigger on T-LESS.

Q2. In general from what I observed, training data + data augmentation is one of the biggest factor to the performance of a 6D pose estimation method. This is what this ablation shows. Note also that we used our own synthetic images for the experiments in the paper (we used PBR ones in the BOP challenge). I think impact of training images + data augmentation are two factors sometimes omitted in publications, and it was important to me to show it's importance in the paper even though it makes other components look not as important. Performing iterative refinement (our method is inspired from DeepIM) is also very important and significantly improves results as already by shown DeepIM. It is true we don't provide results with/without refinement on T-LESS in the paper but we probably should, it can easily be obtained from the code and provided pre-trained models and I may provide it in a future version of the paper. It would be interesting to do exact one-to-one comparison of the other components with DeepIM in same training setting (data, data augmentation etc) but AFAIK (I may be wrong), DeepIM did not release all code and models to reproduce YCB-Video results. At this stage, it's also hard to do a proper exact comparison with my code as my implementation of iterative refinement has many technical differences with DeepIM and is not based on it's code at all, we only used the idea of the paper.

Q3. Yes, all models (paper+BOP challenge) are trained using the same data augmentation. You can check this file which defines configuration of all models trained in the paper. Pre-trained model trained only on our synthetic images is trained with the flag --config ycbv-refiner-syntonly (see section on reproducing single-view results) so you can check definition of the configuration.

Q4. Yes, I can share it and give tables when I have more time, probably later today or later this week. If you really want it now you can check the results in results/ycbv-n_views=1--51549711/results.pth.tar (see section on single-view results in readme), the results are in this file (may be a little hard to find as the data structure here is non trivial). I will also provide results of the models trained on PBR+real images for the BOP challenge.

kirumang commented 4 years ago

Hi Yann, since great discussions are addressed here, I would love to join too!!

I missed your poster session since I had my session at the same time. Anyway, congratulations on your paper and all awards from the BOP challenge!

It was very impressive for me when I saw your augmentation examples. I have never tried that extreme level of augmentations since I thought it is not representing real images anymore. If I remember correctly, too big augmentation sometimes made performance worse when I trained my method. And, I think one significant difference between yours and other methods is that you didn't freeze any layers during training while others froze a few layers. In my case, I froze the first five blocks of Resnet when I train Mask-RCNN for BOP. When I didn't freeze the layers (by a mistake) with the same augmentation, the detection performance was really bad (w/ synthetic images). Thus, it seems a small range of augmentation is sufficient when freezing the lower layers while a larger range of augmentation is required to train all layers. How do you think about this?

I am wondering why you determined not to freeze layers, and whether you have tried the same augmentations during the training of networks with frozen layers (at least for detection methods). I am curious if the extreme augmentations are also applicable to this case (freezing lower layers).

Thank you in advance!

ylabbe commented 4 years ago

@trevor-taeyeop,

See below the per-object YCB-Video results for the refiner model trained on my synthetic data + real images (in the paper right now, left) and for the model trained in the BOP challenge (right). I don’t have time to make proper latex tables but the information is here. It seems that the model trained on PBR+real performs slightly better (in average) than the one trained on my non-photorealistic synthetic data.

image

I also tag @martinsmeyer as I told you during the poster session I would provide these results. However, I would say it’s hard to draw definitive conclusions from only this experiment here since (i) there are real training images for YCB-Video so it’s not pure sim2real, (ii) my dataset has 1 million synthetic images and there are only 50k PBR images. I am also not sure that the procedure and parameters to generate the datasets are exactly the same. I also think the PBR images may be more important for the detector compared to the refiner model which mostly has to focus on the object contours but I haven’t trained detection models with exactly the same parameters on both datasets to compare.

It’s interesting to see that performance is much better with PBR images for the pudding box compared to my images. If I remember correctly for this object you need to look at precise details on the texture to disambiguate the symmetries of the rectangular box and maybe having better rendering of these textures on the PBR images makes it easier to generalize. The data augmentation tends to modify the images quite a lot but augmentation is random and there are still images with little modifications, and some images are not modified at all (with probability 20%, see here. It may seem strange but my intuition was that when you have real images like in YCB-Video maybe you don’t want to augment all images. Same for the parameters of the augmentation, I just looked at batch of images such that there are both very hard and easy images. Using algorithms like AutoAugment or newer similar methods to tune augmentation parameters would certainly improve performance of both pose estimation and detection networks.

@kirumang

Thanks ! And thanks again for providing your ICP implementation !

I did very little experiments with Mask-RCNN (in the paper I used detections from one of your RetinaNet model on T-LESS and from PoseCNN on YCB-V), but I tried with/without data augmentation on T-LESS and YCB-Video (PBR images) and performance was very poor without data augmentation. Besides sim2real, it could also be a pure problem of overfitting to the relatively small training set (50k images) and I train for ~50 epochs (I measure in iterations instead of epochs usually).

Regarding freezing specific layers of MaskRCNN, I just looked more into details of torchvision's code and even though this comment states that layers are not frozen (which is why i thought none are) if pretrained=False (as I use in my code, COCO checkpoint is loaded later), conv1 and layer1 are actually frozen so what I said that no layers are frozen is not exactly true. It would be interesting to see if this is really necessary when using my data augmentation. The pose estimation networks have no frozen layers and are trained from scratch.

wangg12 commented 4 years ago

By default, maskrcnn freezes the layers in the stem and stage1. @kirumang By freezing the first 5 blocks, do you mean freezing the stem (conv1_x) and stage 1,2,3,4 (conv2_x, conv3_x, conv4_x, conv5_x) or just the first 5 residual blocks? Because the blocks are actually not corresponding to the stages. There are multiple blocks in each stage (See table 1 in the paper).

kirumang commented 4 years ago

@ylabbe Thank you, Yann. I will play around a bit to see how the detection performance changes with your augmentation while freezing more stages. My current guess is: augmentation is not crucial when we freeze more stages while having an upper limit on the performance since the number of parameters that can be trained is small. but, pre-trained layers in lower stages are sufficient to encode general features from real images without dramatic augmentation. But, more augmentation is necessary when we freeze smaller stages but have more flexibility to fit the network for target objects, which I expect to have better performance while avoiding overfitting to the small training data. For sure, good augmentation is essential when we train from scratch.

@wangg12 Hi Gu, Thank you for the correction, I had to refer them as stages. Let me clarify. I have used Matterport's implementation of Mask-RCNN link. And, I have passed the argument, layers="5+". This means it freezes up to conv4_x and fine-tunes conv5_x and the head.

wangg12 commented 4 years ago

Hi @kirumang, so you are actually freezing the stem and stages 1,2,3. In my experience, I usually get worse results if I freeze stages >=3 and a good detection performance can be achieved by freezing the stem and stage 1 (the maskrcnn default) or additionally freezing stage 2.

BTW, I am also curious about how cosypose's augmentation performs compared to other augmentation styles like those in Pix2Pose or AAE.

ylabbe commented 4 years ago

@wangg12 thanks for your input ! I am considering doing more experiments to see if other augmentations are better or not. I also think in general using data augmentation is necessary when training on synthetic images but it certainly isn't sufficient to good performance. For pose estimation, the iterative refiner is also super important.

Out of curiosity, I am wondering why you are not using DeepIM to refine your poses in RGB-only setting in your CDPNv2 submission and wether you have tried to use it or not for the challenge. I also saw in your CDPNv2 method description that you are using "FCOS with BackBone of vovnet-V2-57-FP" for detection instead of Mask-RCNN. Did you observe significant improvements over Mask-RCNN for 2D detection using this one ? It would be great if you can share some insight on these points, but I also understand it you don't want to or cannot.

wangg12 commented 4 years ago

Hi @ylabbe, congratulations to your BOP results! We have tried DeepIM last year but not successful, perhaps the reason is that we don't have good synthetic data back then. That's partially why we didn't try it this year. But I think it should work with the new PBR data based on your observation.

Regarding the detection, last year, we tried Mask-RCNN and found it did not improve over RetinaNet significantly. And this year we tried both RetinaNet and FCOS, and FCOS turned out to be more precise and faster. But for sure there are many detection and instance segmentation methods worth trying, we just don't have that many resources.

MartinSmeyer commented 4 years ago

Thanks @ylabbe for the comparisons. I would be actually interested how much the difference is if you do not use any real data. So comparing the pretrained models trained on the BlenderProc images vs rasterized synthetic data. I think together with real data, the sim2real transfer is pretty straightforward. But you had nice results on datasets like HB as well.

ylabbe commented 4 years ago

@MartinSmeyer

Synthetic only results image It's better in average with PBR images: ADD(-S) 74.4 vs 69.5 for my synthetic images. PoseCNN has ADD(-S) 61.3. Not sure what's happening with pudding/gelatin box since conclusion is different if you add real images.

MartinSmeyer commented 4 years ago

Great, thanks!