zju3dv / OnePose

Code for "OnePose: One-Shot Object Pose Estimation without CAD Models", CVPR 2022
Apache License 2.0
936 stars 78 forks source link

Using the pretrained model on real data #14

Closed aditya1709 closed 2 years ago

aditya1709 commented 2 years ago

Just a little confused on how the network would function. If I used the pre-trained model on say a video recording of the shoe that I just made would it work out of the box by returning the object pose? Or does it involve recording a video of this particular object and then also supplying it with query images for the pose estimation? The inference part in the paper is very unclear. Any help will be appreciated.

siatheindochinese commented 2 years ago

Looking at the onepose configs and datasets, I noticed a few things:

DeriZSY commented 2 years ago

Just a little confused on how the network would function. If I used the pre-trained model on say a video recording of the shoe that I just made would it work out of the box by returning the object pose? Or does it involve recording a video of this particular object and then also supplying it with query images for the pose estimation? The inference part in the paper is very unclear. Any help will be appreciated.

Hi, thanks for your interest in our work. I will try to explain the mechanism of our approach:

  1. The pre-trained model is a generic model, the GATs network, for matching 2D and 3D keypoints
  2. For a new target object (say a shoe), you need a video scan with 3D bounding box annotation and camera poses for object reconstruction by structure from motion.
  3. Given an input image of the shoe, 2D keypoints would be extracted on the image and matched with 3D keypoints on the reconstructed model from the previous step to form 2D-3D correspondences. Then the object poses can be solved by the Perspective-N-Points algorithm.

You are welcome to reopen the issue if you have any further questions.

qq456cvb commented 2 years ago

Looking at the onepose configs and datasets, I noticed a few things:

  • recordings of the videos (in .m4v) are converted to frames (in .png), in the color_full folder
  • for each object, there are 4 unique datasets to it (i.e colorbox-box has colorbox-1, colorbox-2, colorbox-3 and colorbox-4)
  • the 1st dataset is used to create the SfM point cloud model
  • the 4th dataset is used for inferencing (predicting the pose of the desired object and collecting metrics i.e 5cm-5deg)
  • for training the GATsSPG, the 1st, 2nd and 3rd datasets are used.

I am confused, it seems that inference stage uses the same object in the training phase. I wonder how does this algorithm generalize to novel/unseen objects. Does evaluating on Objectron shoes only involve training with shoes from OnePose dataset? It is not clear in the paper. Or more generally, can we just train on OnePose cups and generalize to Objectron shoes?

Another related question:

  1. For a new target object (say a shoe), you need a video scan with 3D bounding box annotation and camera poses for object reconstruction by structure from motion.

Does this step involve training the network parameters with 2D-3D correspondences?

DeriZSY commented 2 years ago

Looking at the onepose configs and datasets, I noticed a few things:

  • recordings of the videos (in .m4v) are converted to frames (in .png), in the color_full folder
  • for each object, there are 4 unique datasets to it (i.e colorbox-box has colorbox-1, colorbox-2, colorbox-3 and colorbox-4)
  • the 1st dataset is used to create the SfM point cloud model
  • the 4th dataset is used for inferencing (predicting the pose of the desired object and collecting metrics i.e 5cm-5deg)
  • for training the GATsSPG, the 1st, 2nd and 3rd datasets are used.

I am confused, it seems that inference stage uses the same object in the training phase. I wonder how does this algorithm generalize to novel/unseen objects. Does evaluating on Objectron shoes only involve training with shoes from OnePose dataset? It is not clear in the paper. Or more generally, can we just train on OnePose cups and generalize to Objectron shoes?

Another related question:

  1. For a new target object (say a shoe), you need a video scan with 3D bounding box annotation and camera poses for object reconstruction by structure from motion.

Does this step involve training the network parameters with 2D-3D correspondences?

Hi, thanks for your interest in our work. I think there may be some misconceptions about our problem setting, and I will try to clarify it.

In the paper, we actually working a different setting of One-Shot object pose estimation, which is quite different from the renowned problem settings used in methods like PVNet or Objectron.

Given a target object say a shoe, existing methods either train on annotated data for that exact shoe (instance-level method like PVNet) or train on a large dataset of all sorts of shoes (category-level method like Objectron). And then, the trained network is used for pose estimation.

In our setting, we run a SfM algorithm to build an object map on a few annotated data of that exact shoe for pose estimation, instead of training or fine-tuning any networks. However, we indeed train a network for 2D-3D keypoint matching, which does not require re-training or fine-tuning for novel objects.

Please refer to the introduction part of our paper for more details.

  1. Q: " it seems that the inference stage uses the same object in the training phase" A: Yes, it is the same object but there is no training phase but only map-building phase.

  2. Q: "I wonder how does this algorithm generalize to novel/unseen objects." A: No, the pose estimation cannot work on novel objects without annotated data for map building.

  3. Q: " Does evaluating on Objectron shoes only involve training with shoes from OnePose dataset? " A: No. We compare Objectron and OnePose with data in our own dataset which is clearly explained in the paper.

  4. Q: " Or more generally, can we just train on OnePose cups and generalize to Objectron shoes?" A: Refer to A2.

  5. Q: "Does this step involve training the network parameters with 2D-3D correspondences?" A: No.

qq456cvb commented 2 years ago

Thanks for the detailed reply! It is clearer to me now, but there are still some questions left.

However, we indeed train a network for 2D-3D keypoint matching, which does not require re-training or fine-tuning for novel objects.

It seems that this 2D-3D keypoint matching is able to generalize to novel objects without re-training. But I do not see experiments validating this, e.g., training the matching GATs with known objects, validating (building the SfM map) for the unseen object, and then testing for the unseen object. Section 4.3 only gives the experiment training/validation/testing on the same object, while no novel objects are introduced in the validation/testing phase. And it seems that Section 4.3 proves the 2D-3D keypoint matching network could generalize to novel scenes/backgrounds with the same object instead of novel objects. Correct me if I miss something.

aditya1709 commented 2 years ago

@DeriZSY My question perhaps will be more clear with an example. Technically I have understood the flow of the OnePose algorithm. The high level question is about its generalization ability - Suppose if I collect 4 videos for training/inference of say a sneaker (air jordan). After training can I use this network to perform inference on say a north star hiking boots? Or should I go through the process of collecting 4 more videos for the north star hiking shoe and train it to work on this?

siatheindochinese commented 2 years ago

@qq456cvb You can run the inference scripts provided in the repo on the validation set (unseen objects) and check out the cm-deg metrics. There's way more objects in the validation set than the training set.

Objects in the validation set are all different from the ones in the training set, you can see a list of them in the config files.

qq456cvb commented 2 years ago

@siatheindochinese Thanks for your reply. That makes sense, and I will try it once I've downloaded the data.

AnukritiSinghh commented 1 year ago

@DeriZSY My question perhaps will be more clear with an example. Technically I have understood the flow of the OnePose algorithm. The high level question is about its generalization ability - Suppose if I collect 4 videos for training/inference of say a sneaker (air jordan). After training can I use this network to perform inference on say a north star hiking boots? Or should I go through the process of collecting 4 more videos for the north star hiking shoe and train it to work on this?

hey, were you able to figure out the answer for this?

luccachiang commented 1 year ago

Hi, thanks for your work! Would you please write a tutorial to demonstrate how to use OnePose in custom real world data? I wonder what kind of data we need and how to get our model to infer on the real data.