Question about reproducing multi-view results

I have question about reproducing multi-view results in YCB. I tried to make your results using your pretrained model.

**Q1. Can I ask how to get the similar results of the paper??** And In ycb, Why not have coarse pretrained model compared to T-Less?? Does it not necessary??

1:18:06.537026 - Skipped: posecnn_init/cand_inputs (N=15435)
1:18:06.537151 - Skipped: posecnn_init/cand_matched (N=2875)
1:18:06.537197 - Skipped: posecnn_init/external_coarse (N=15435)
1:18:06.537236 - Skipped: posecnn_init/refiner/iteration=1 (N=15435)
1:18:06.537272 - Evaluation : posecnn_init/refiner/iteration=2 (N=15435)
1:42:17.024973 - Skipped: posecnn_init/scene/cameras (N=857)
1:42:17.025123 - Skipped: posecnn_init/scene/objects (N=1284)
1:42:17.060143 - --------------------------------------------------------------------------------
1:42:17.060260 - Results:
PoseCNN/AUC of ADD(-S): 0.6128217168893441
Singleview/AUC of ADD(-S): 0.5329731760227406
Singleview/AUC of ADD-S: 0.6900167570728803
Multiview (n=5)/AUC of ADD(-S): 0.5320449810191621
Multiview (n=5)/AUC of ADD-S: 0.6895387685823141

I followed this command twice but got same results. python -m cosypose.scripts.run_cosypose_eval --config ycbv --nviews=5

I downloaded pretrained using this command

# YCB-V Single-view refiner
python -m cosypose.scripts.download --model=ycbv-refiner-finetune--251020

Hi @trevor-taeyeop,

We use PoseCNN predictions in the paper for coarse initialization on YCBV. Coarse model can of course be used on YCB-Video if one is trained. If you look at BOP20 results section you can see that we also reported multi-view results on YCB-Video, T-LESS and HomebrewDB (all using same parameters) using our trained detector + coarse + refiner models of the BOP challenge.

Your results seem very strange as the single-view results are not correct. This is likely an installation problem. Could you share the results file directory that is saved by the script ? Can you also report the full output of python -m cosypose.scripts.run_cosypose_eval --config ycbv --nviews=5 --debug (including the prints at the beginning with pybullet build time, EGL and GPU specs) ?

I used this conda installation. conda env create -n cosypose --file environment.yaml

[My Result] https://drive.google.com/drive/folders/1ar5w_Jpu2arxTKMQrdZzQ8dygShVGRMM?usp=sharing

[My Result (debug)] https://drive.google.com/drive/folders/1QlYALRYUSjRiccpBN56o7zq0W0-zLDk2?usp=sharing

pybullet build time: Aug 14 2020 02:50:31
0:00:00.000749 - Starting ...
0:00:04.687016 - SAVE DIR: /sdata1/workspace/cosypose/local_data/results/ycbv-n_views=5--7420600231
0:00:04.687616 - Coarse: None
0:00:04.687715 - Refiner: ycbv-refiner-finetune--251020
0:00:04.938782 - Building index and loading annotations...
[Memory]5.3s, 0.1min    : Loading build_index...
Loaded EGL 1.5 after reload.
GL_VENDOR=NVIDIA Corporation
GL_RENDERER=TITAN Xp/PCIe/SSE2
GL_VERSION=4.6.0 NVIDIA 418.56
GL_SHADING_LANGUAGE_VERSION=4.60 NVIDIA
Version = 4.6.0 NVIDIA 418.56
Vendor = NVIDIA Corporation
Renderer = TITAN Xp/PCIe/SSE2
0:00:13.786362 - Backbone: efficientnet-b3
[Memory]14.7s, 0.2min   : Loading load_posecnn_results...
0:00:14.117544 - Prediction: posecnn_init
  0%|                                                                                                                                                                                         | 0/2 [00:00<?, ?it/s]0:00:14.245755 - --------------------------------------------------------------------------------
0:00:14.246253 - Scene: [48]
0:00:14.246519 - Views: [1412 1626 1879 1894 1947]
0:00:14.246709 - Group: [1]
0:00:14.246786 - Image has 25 gt detections. (not used)
ven = NVIDIA Corporation
ven = NVIDIA Corporation
0:00:16.520156 - Pose prediction on 29 detections (n_iterations=2): 0:00:02.004471
0:00:16.537340 - Num candidates: 29
0:00:16.537419 - Num views: 5
0:00:16.537896 - Estimating camera poses using RANSAC.
0:00:16.584814 - Matched candidates: 6
0:00:16.584917 - RANSAC time_models: 0:00:00.016256
0:00:16.584958 - RANSAC time_score: 0:00:00.018119
0:00:16.584991 - RANSAC time_misc: 0:00:00.012186
0:00:16.990041 - BA time_init: 0:00:00.004868
0:00:16.990213 - BA time_opt: 0:00:00.310305
0:00:16.990285 - BA time_misc: 0:00:00.019971
0:00:16.753706 - --------------------------------------------------------------------------------
 50%|████████████████████████████████████████████████████████████████████████████████████████▌                                                                                        | 1/2 [00:02<00:02,  2.62s/it]0:00:16.878991 - --------------------------------------------------------------------------------
0:00:16.879321 - Scene: [48]
0:00:16.879462 - Views: [ 803 1104 1233 1745 1866]
0:00:16.879575 - Group: [0]
0:00:16.879627 - Image has 25 gt detections. (not used)
0:00:18.469205 - Pose prediction on 30 detections (n_iterations=2): 0:00:01.331535
0:00:18.484633 - Num candidates: 30
0:00:18.484743 - Num views: 5
0:00:18.485215 - Estimating camera poses using RANSAC.
0:00:18.598351 - Matched candidates: 0
0:00:18.598478 - RANSAC time_models: 0:00:00.057635
0:00:18.598537 - RANSAC time_score: 0:00:00.045624
0:00:18.598572 - RANSAC time_misc: 0:00:00.009501
0:00:18.363152 - --------------------------------------------------------------------------------
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.12s/it]
0:00:18.378871 - Done with predictions
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 40.54it/s]
0:00:18.687304 - Evaluation : posecnn (N=15435)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00,  1.32it/s]
0:00:26.514516 - Skipped: posecnn_init/ba_input (N=6)
0:00:26.514750 - Skipped: posecnn_init/ba_output (N=6)
0:00:26.514824 - Evaluation : posecnn_init/ba_output+all_cand (N=65)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:06<00:00,  1.44it/s]
0:00:33.708746 - Skipped: posecnn_init/cand_inputs (N=59)
0:00:33.708979 - Skipped: posecnn_init/cand_matched (N=6)
0:00:33.709041 - Skipped: posecnn_init/external_coarse (N=59)
0:00:33.709082 - Skipped: posecnn_init/refiner/iteration=1 (N=59)
0:00:33.709122 - Evaluation : posecnn_init/refiner/iteration=2 (N=59)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00,  1.40it/s]
0:00:41.084555 - Skipped: posecnn_init/scene/cameras (N=2)
0:00:41.084680 - Skipped: posecnn_init/scene/objects (N=3)
0:00:41.101151 - --------------------------------------------------------------------------------
0:00:41.101260 - Results:
PoseCNN/AUC of ADD(-S): 0.6420061089470982
Singleview/AUC of ADD(-S): 0.6217930275015533
Singleview/AUC of ADD-S: 0.7333381084538997
Multiview (n=5)/AUC of ADD(-S): 0.6198825382627546
Multiview (n=5)/AUC of ADD-S: 0.7323008635826409
0:00:41.101327 - --------------------------------------------------------------------------------
0:00:41.198950 - Saved: /sdata1/workspace/cosypose/local_data/results/ycbv-n_views=5--7420600231
Destroy EGL OpenGL window.

Thanks ! I think the issue is pybullet which is shown to be compiled Aug 14, but it's important to compile and use the one which is in deps. Can you try:

pip uninstall pybullet
cd deps/bullet3
python setup.py build
python setup.py install

and run the same command ? The pybullet build time should be today.

My results look this for the debug one:

PoseCNN/AUC of ADD(-S): 0.6420061091333628
Singleview/AUC of ADD(-S): 0.8558478841930628
Singleview/AUC of ADD-S: 0.8979679813981056
Multiview (n=5)/AUC of ADD(-S): 0.8711556944996118
Multiview (n=5)/AUC of ADD-S: 0.9246994818560779

Yes the problem was solved after installing new pybullet. What's the difference in pybullet?? just compling issue??


PoseCNN/AUC of ADD(-S): 0.6128217168893441
Singleview/AUC of ADD(-S): 0.8446906646130059
Singleview/AUC of ADD-S: 0.8984468300521063
Multiview (n=5)/AUC of ADD(-S): 0.8890282181359238
Multiview (n=5)/AUC of ADD-S: 0.934013368742991

Great @trevor-taeyeop , thanks for pointing this out. I think the problem is that you installed pybullet 2.5.5 in another environment and pip used the cached wheel file instead of recompiling from the sources in the repo. I changed pybullet package version name in my version of pybullet so it shouldn't happen in the future.

You can check differences with pybullet in my fork. Main differences are:

By default camera pose roll is ignored by pybullet's pb.computeViewMatrixFromYawPitchRoll, my version handles roll.
Fixes to the EGL plugin: bugs in texture rendering, add ability to pick EGL device (useful on multi-GPU machines), remove the XYZ axes that are shown by default and hard-coded in the plugin.

Some of these bugs may have been fixed in bullet's upstream but I still use my version to ensure reproducibility. Rendering is of course very important as the renderings are given as input to the pose estimation models. If renderings are incorrect, pose predicted will be incorrect as you observed when using a pybullet version with incorrect renderings.

Thanks!! In addition, I have some questions regarding multiview.

As I understood, *-meta.mat file contain camera pose (rotation_translation_matrix: RT of the camera motion in 3D).

If I have two images t and t+1 frame. Most Scenario, I understood objects were fixed and only the camera pose was moved. Therefore, The relative camera pose(t,t+1) would be the same as the other object relative pose(t,t+1). But it wasn't and each object has a different relative object pose.

Q1. Can you advice which part I am missing??

@trevor-taeyeop I don't really understand what you are referring to. There are no *-meta.mat (matlab files) files in my repository. I also don't really understand your example, what frames are you referring to ? Are you using the script to run in a custom scenario ? Your notations seems to imply temporal notations but images are processed in batch in the multi-view setup and there is notion of temporal continuity here. Can you clarify a bit ?

Maybe it's a little less related to your code. I just did it because I was curious while looking at your paper. Sorry for asking an unclear and little unrelated question.

A unified framework (2018 ECCV) used the ground truth camera pose to align the viewpoint. Therefore, I just want to simply check when camera pose was given, how the alignment was processed and the relative object pose would changed.

To get camera pose, I used the YCB annoation file (*-meta.mat) from YCB toolbox github (https://github.com/yuxng/YCB_Video_toolbox).

In this process, the relative camera pose will be used to align the viewpoint. I thought this relative camera pose would be the same as the relative object pose of the image view. But my simple code show the relative camera pose and relative object pose wouldn't same.

This is my sudo code. cam_relative means relative camera pose obj_relative means relative object pose.

            tmp = np.zeros((1, 4))
            tmp[..., -1] = 1
            # camera pose_relative
            cam_pose_a= np.concatenate((cam_pose,tmp))
            cam_pose_b= np.concatenate((cam_pose2,tmp))
            inv_cam_pose = np.linalg.inv(cam_pose_a)
            cam_relative = np.matmul(inv_cam_pose,cam_pose_b)

            np.set_printoptions(suppress=True)

            # object pose_relative
            obj_pose_a = np.concatenate((pose_label, tmp))
            obj_pose_b = np.concatenate((pose_label2, tmp))
            inv_obj_pose_a = np.linalg.inv(obj_pose_a)
            obj_relative = np.matmul(inv_obj_pose_a, obj_pose_b)

            # pose_new_from_cam_relative == obj_pose_b
            pose_new_from_cam = np.concatenate((pose_label,tmp))
            pose_new_from_cam = np.matmul(pose_new_from_cam,cam_relative)[:3,:]
            # pose_new_from_pose_relative == obj_pose_b
            pose_new_from_obj = np.concatenate((pose_label, tmp))
            pose_new_from_obj = np.matmul(pose_new_from_obj, obj_relative)[:3, :]

I am not sure what you are trying to do but from what I know some YCB annotations are a bit noisy and it could be possible that relative object poses are not exactly fixed across images. Please see with the people who made the dataset directly.

This is really unrelated to this repository and paper. Please close the issue if I have addressed your problem regarding things related to this code.

Hi @ylabbe I also tried the evaluation as taeyeop-lee at the beginn of this issue. And I have also not that good results as it should be. I tried the "pybullet" fix as described, but that does not help. Perhaps you can help me finding the problem? Here are my result:

python -m cosypose.scripts.run_cosypose_eval --config ycbv --nviews=5 --debug

Setting OMP and MKL num threads to 1.
pybullet build time: Sep 22 2021 08:30:38
0:00:00.000844 - Starting ...
0:00:01.148508 - SAVE DIR: /home/scale/dev/cosypose/local_data/results/ycbv-n_views=5--3078392460
0:00:01.148588 - Coarse: None
0:00:01.148610 - Refiner: ycbv-refiner-finetune--251020
0:00:01.686880 - Building index and loading annotations...
EGL device choice: 0 of 3 (from EGL_VISIBLE_DEVICES)
Loaded EGL 1.5 after reload.
GL_VENDOR=NVIDIA Corporation
GL_RENDERER=Quadro P4000/PCIe/SSE2
GL_VERSION=4.6.0 NVIDIA 460.80
GL_SHADING_LANGUAGE_VERSION=4.60 NVIDIA
Version = 4.6.0 NVIDIA 460.80
Vendor = NVIDIA Corporation
Renderer = Quadro P4000/PCIe/SSE2
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000021/obj_000021.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000011/obj_000011.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000019/obj_000019.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000007/obj_000007.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000002/obj_000002.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000003/obj_000003.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000001/obj_000001.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000015/obj_000015.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000020/obj_000020.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000014/obj_000014.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000004/obj_000004.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000016/obj_000016.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000006/obj_000006.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000005/obj_000005.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000008/obj_000008.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000010/obj_000010.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000012/obj_000012.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000018/obj_000018.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000017/obj_000017.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000009/obj_000009.urdf
/home/scale/dev/cosypose/local_data/urdfs/ycbv/obj_000013/obj_000013.urdf
0:00:09.428402 - Backbone: efficientnet-b3
[Memory]9.8s, 0.2min    : Loading load_posecnn_results...
0:00:09.016403 - Prediction: posecnn_init
  0%|                                                                                                                                              | 0/2 [00:00<?, ?it/s]0:00:09.096409 - --------------------------------------------------------------------------------
0:00:09.096610 - Scene: [48]
0:00:09.096692 - Views: [1412 1626 1879 1894 1947]
0:00:09.096751 - Group: [1]
0:00:09.096776 - Image has 25 gt detections. (not used)
ven = NVIDIA Corporation
ven = NVIDIA Corporation
0:00:10.574877 - Pose prediction on 29 detections (n_iterations=2): 0:00:00.935247
0:00:10.576524 - Num candidates: 29
0:00:10.576598 - Num views: 5
0:00:10.576954 - Estimating camera poses using RANSAC.
0:00:10.625365 - Matched candidates: 20
0:00:10.625465 - RANSAC time_models: 0:00:00.009052
0:00:10.625510 - RANSAC time_score: 0:00:00.024841
0:00:10.625544 - RANSAC time_misc: 0:00:00.014380
0:00:13.381330 - BA time_init: 0:00:00.002969
0:00:13.381428 - BA time_opt: 0:00:02.529140
0:00:13.381455 - BA time_misc: 0:00:00.091897
0:00:12.856525 - --------------------------------------------------------------------------------
 50%|███████████████████████████████████████████████████████████████████                                                                   | 1/2 [00:03<00:03,  3.83s/it]0:00:12.921968 - --------------------------------------------------------------------------------
0:00:12.922147 - Scene: [48]
0:00:12.922225 - Views: [ 803 1104 1233 1745 1866]
0:00:12.922282 - Group: [0]
0:00:12.922306 - Image has 25 gt detections. (not used)
0:00:14.272430 - Pose prediction on 30 detections (n_iterations=2): 0:00:00.807332
0:00:14.273899 - Num candidates: 30
0:00:14.273969 - Num views: 5
0:00:14.274318 - Estimating camera poses using RANSAC.
0:00:14.323040 - Matched candidates: 18
0:00:14.323128 - RANSAC time_models: 0:00:00.008997
0:00:14.323154 - RANSAC time_score: 0:00:00.024973
0:00:14.323175 - RANSAC time_misc: 0:00:00.014623
0:00:16.937514 - BA time_init: 0:00:00.002491
0:00:16.937597 - BA time_opt: 0:00:02.469718
0:00:16.937621 - BA time_misc: 0:00:00.094344
0:00:19.434161 - BA time_init: 0:00:00.002712
0:00:19.434240 - BA time_opt: 0:00:02.309194
0:00:19.434263 - BA time_misc: 0:00:00.095649
0:00:18.911255 - --------------------------------------------------------------------------------
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.94s/it]
0:00:18.922851 - Done with predictions
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 66.98it/s]
0:00:19.098528 - Evaluation : posecnn (N=15435)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  2.93it/s]
/home/scale/dev/cosypose/cosypose/evaluation/meters/pose_meters.py:278: FutureWarning: The 'contains' method is deprecated and will be removed in a future version. Use 'key in index' instead of 'index.contains(key)'
  if df.index.contains(label):
0:00:22.703747 - Skipped: posecnn_init/ba_input (N=45)
0:00:22.703847 - Skipped: posecnn_init/ba_output (N=45)
0:00:22.703871 - Evaluation : posecnn_init/ba_output+all_cand (N=104)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  2.85it/s]
0:00:26.417408 - Skipped: posecnn_init/cand_inputs (N=59)
0:00:26.417486 - Skipped: posecnn_init/cand_matched (N=38)
0:00:26.417511 - Skipped: posecnn_init/external_coarse (N=59)
0:00:26.417549 - Skipped: posecnn_init/refiner/iteration=1 (N=59)
0:00:26.417571 - Evaluation : posecnn_init/refiner/iteration=2 (N=59)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  3.00it/s]
0:00:29.959141 - Skipped: posecnn_init/scene/cameras (N=9)
0:00:29.959223 - Skipped: posecnn_init/scene/objects (N=14)
0:00:29.970344 - --------------------------------------------------------------------------------
0:00:29.970436 - Results:
PoseCNN/AUC of ADD(-S): 0.6420061091333628
Singleview/AUC of ADD(-S): 0.6740824898239226
Singleview/AUC of ADD-S: 0.7165475445333869
Multiview (n=5)/AUC of ADD(-S): 0.6982525571808219
Multiview (n=5)/AUC of ADD-S: 0.7514866081159564
0:00:29.970467 - --------------------------------------------------------------------------------
0:00:30.032115 - Saved: /home/scale/dev/cosypose/local_data/results/ycbv-n_views=5--3078392460
Destroy EGL OpenGL window.

ylabbe / cosypose

Question about reproducing multi-view results #2