zgojcic / Rigid3DSceneFlow

[CVPR 2021, Oral] "Weakly Supervised Learning of Rigid 3D Scene Flow"
135 stars 18 forks source link

About FG rigid transformation estimation #2

Closed XuyangBai closed 3 years ago

XuyangBai commented 3 years ago

Hi @zgojcic thanks for sharing this inspiring work!

Since the output of DBSCAN is unordered and even the number of clusters might be different, so how did you determine the corresponding FG instances to compute the rigid transformation? And will the instance segmentation result be good enough for DBSCAN at the beginning of the training stage?

Another small question is why you use different tau for Eq.9 and Eq.10? Is the reason that they are doing softmax among a different numbers of correspondences?

Best, Xuyang

zgojcic commented 3 years ago

Hi and thanks @XuyangBai.

The rigid transformation for the foreground clusters is computed from the predicted flow vectors as we do not have cluster-cluster correspondences across the epochs (we actually only need to cluster the source epoch). You can check this part of the code: https://github.com/zgojcic/Rigid3DSceneFlow/blob/7fa57e3ddccf605dca63ded04825bba2272cae4a/lib/model/rigid_3d_sf.py#L268-L277 This might even be improved but as you mentioned the cluster numbers could even be different with object (dis)appearing. In the test time optimization phase we simply index all the foreground points of the target epoch for each of the clusters in the source epoch.

During training we use the ground truth FG/BG segmentation masks to index the foreground points such that there should be no problem for DBSCAN during training.

Um the reason for different tao is just kind of an legacy thing. We did not purposefully intend to have a different value, our goal was just to prevent division by zero and to ensure the gradient flow. Somehow during implementation we used two values and when writing we already did a lot of experiments. We did not really optimize this value so it should probably not make a large difference as long as it is not too large or too small.

Best, Zan

XuyangBai commented 3 years ago

Hi @zgojcic

Thanks for pointing out the code snippets and I am clear for this part now.

Another question is that the different design of supervising the foreground and background scene flow. I understand the foreground ( rigid objects ) do not have ground-truth poses so you have to supervise the scene flow like Eq. 5 and 6, and the background scene flow could be supervised by the ground truth pose. But it seems supervising by gt pose is more stronger than supervising by Eq.5, 6, because in Table 3 I find L_ego bring the largest performance gain while L_rigid serve as a regularization term more or less. So if the ground truth poses of rigid objects are also available (datasets with tracking labels such as nuScene, which you could know the cluster-cluster correspondences as well as their transformations), maybe the network could also be further improved by supervising the rigid object scene flow also through L_ego?

Best, Xuyang.

zgojcic commented 3 years ago

Hi @XuyangBai,

yeah you are completely right, supervising with GT poses is of course a much stronger supervision than our formulation for the foreground objects. If one has access to the rigid transformation of the foreground objects this would definitely further improve the performance, but it does impose access to the additional annotations.

If they are available in nuScenes one could easily use them. For example, also in semanticKITTI temporally consistent instance level annotations are available and could easily be converted to transformations of the foreground objects.

We did pursue this path as it was out of the scope of our paper, which was showing the domain gap and providing a way on how to train on the target domain. There are several ways in which our method could/should be improved. I would be happy to discuss them offline if you plan to work on something in this field. (Just send me an email and we can arrange a meeting)

XuyangBai commented 3 years ago

Hi @zgojcic

Thanks a lot for the explanation. It would be interesting to leverage the instance-level annotation to further improve your framework. I have read your paper after you released it, and recently when I read papers on 3D object tracking, it came into my mind that your method may work as a tracking algorithm if you could find the cluster-cluster correspondences. And I did find some papers using scene flow for object tracking in the tracking area after that. I will let you know if I plan to work in this field or have more ideas to discuss with you. Again, congratulations on this nice work.

Best, Xuyang