weiyithu / SurroundDepth

[CoRL 2022] SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation
MIT License
258 stars 38 forks source link

A question about self-supervised and supervised learning #3

Open hhhharold opened 2 years ago

hhhharold commented 2 years ago

Basically we use self-supervised methods to train depth prediction models. Have you tried self-supervised combined with supervised methods? You know, Nuscenes and DDAD datasets have some sparse point cloud. I tried but got bad performance and did not figure out what lead to such result.

weiyithu commented 2 years ago

Hi, how do you use sparse point clouds? If you have point cloud, you do not need sfm pretraining and the depth network can learn the scale since point cloud has real-world scale. We have tried to combine with l1 supervision from point cloud, which can boost the performance.

hhhharold commented 2 years ago

Thanks for you reply~~~ Did you tried it on DDAD or Nuscenes? I tried on Nuscenes and used the ground truth pose instead of PoseNets. In every iteration, both photometirc loss and L1 loss with sparse point cloud are calculated. But the depth result was like lidar waves (shown in figure). I tried to adjust the weight of the L1 loss from 1.0 to 0.001, but the result still not well.

cam0_000

My code of calculating supervision loss is : `

if self.with_gt_depth:

gt_depth = inputs[('gt_depth', 0)]

pred_depth = outputs[("depth", 0, scale)].squeeze(1)

mask = torch.logical_and(gt_depth > self.min_depth, gt_depth < self.max_depth)

pred_depth = pred_depth[mask]

gt_depth = gt_depth[mask]

losses["sup_depth_loss/{}".format(scale)] = self.sup_weight * F.l1_loss(pred_depth, gt_depth, size_average=True)

`

weiyithu commented 2 years ago

I have tried on DDAD and lidar supervision improves quantitative results. Have you evaluated quantitative results?

hhhharold commented 2 years ago

I have tried on DDAD and lidar supervision improves quantitative results. Have you evaluated quantitative results?

Yes, the quantitative result of self-supervised was good. But after I adding some lidar supervision, the result was not well. I will have another try on DDAD and determine whether there are some mistakes in my code.

After adding lidar supervision: abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | & 0.321 & 3.702 & 10.027 & 0.632 & 0.427 & 0.586 & 0.704

By the way, do you have any solution to the problem of depth consistency of the same instance? When I generate Vidar from predicted depth map in bird eye view, the contour of vehicles are not flat but somewhat distorted, which was caused by the inconsistency of depth predicted by the model. As shown in my figure (self-supervised trained), the rear of the white van and cars parked on the left side of the street have some noticeable distortions.

Screenshot from 2022-04-27 11-24-28 .

weiyithu commented 2 years ago

The quantitative results seem to be strange. I remember that if we add lidar supervision the abs rel will be less than 0.190. Does the 'inconsistency' here mean the inconsistency between surrounding views depth maps?

hhhharold commented 2 years ago

I tried on DDAD and the quantitative results on validation set (scale-ambiguous) were pretty good.

CAMERA_01 abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | & 0.124 & 3.287 & 18.271 & 0.206 & 0.868 & 0.962 & 0.988 \ CAMERA_05 abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | & 0.126 & 2.964 & 17.182 & 0.220 & 0.843 & 0.929 & 0.974 \ CAMERA_07 abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | & 0.092 & 1.676 & 12.395 & 0.166 & 0.878 & 0.964 & 0.992 \ CAMERA_09 abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | & 0.135 & 2.684 & 17.036 & 0.233 & 0.849 & 0.943 & 0.971 \ CAMERA_08 abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | & 0.181 & 4.067 & 17.624 & 0.301 & 0.763 & 0.899 & 0.957 \ CAMERA_06 abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | & 0.122 & 1.981 & 10.549 & 0.227 & 0.871 & 0.955 & 0.977 \ all abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | & 0.130 & 2.777 & 15.510 & 0.225 & 0.845 & 0.942 & 0.976 \

=============================================================

The "inconsistency" means the inconsistency depth map of a instance in one view (Front view for example). As shown in the figure below, the rear surface of the van should be geometrically flat and straight, not crooked.

1569371271086785

Screenshot from 2022-06-07 14-40-55

weiyithu commented 2 years ago

If you use LiDAR supervision, you may use scale-aware evaluation since LiDAR points has real-world scale. I think the inconsistency is caused by inaccurate depth estimation. In fact, monocular depth estimation is not good at recover planes.

GANWANSHUI commented 2 years ago

Thanks for you reply~~~ Did you tried it on DDAD or Nuscenes? I tried on Nuscenes and used the ground truth pose instead of PoseNets. In every iteration, both photometirc loss and L1 loss with sparse point cloud are calculated. But the depth result was like lidar waves (shown in figure). I tried to adjust the weight of the L1 loss from 1.0 to 0.001, but the result still not well.

cam0_000

My code of calculating supervision loss is : `

if self.with_gt_depth:

gt_depth = inputs[('gt_depth', 0)]

pred_depth = outputs[("depth", 0, scale)].squeeze(1)

mask = torch.logical_and(gt_depth > self.min_depth, gt_depth < self.max_depth)

pred_depth = pred_depth[mask]

gt_depth = gt_depth[mask]

losses["sup_depth_loss/{}".format(scale)] = self.sup_weight * F.l1_loss(pred_depth, gt_depth, size_average=True)

`

Hi, I also tried the model with GT supervision in the Nuscene dataset and met the same problem (the lidar waves-like depth map). May I ask that have you addressed this problem? Thank you very much!

DRosemei commented 1 year ago

@weiyithu @GANWANSHUI @hhhharold, Thanks for your works! I want to know how do you use GT poses instead of predicted poses. My results are scale-ambigious.