Question about scale-aware model type nuScenes evaluation

kyoyachuan commented 1 year ago

Hi, I tried to evaluate nuScenes validation set performance with your released model, which is nusc_scale model. Since this model should be scale-aware, I suppose the scale-aware evaluation results should be similar as you mentioned in README.md, which is

type	dataset	Abs Rel	Sq Rel	delta < 1.25
scale-aware	nuScenes	0.280	4.401	0.661

However, it turns out the result worse than expected. The scale-ambiguity evaluation was higher than scale-aware evaluation in the scale-aware model. The relevant output was shown as below

Loading depth weights...
Loading encoder weights...
Training model named: nusc_scale
There are 20096 training items and 6019 validation items

median: 0.33512431383132935
-> Evaluating 1
scale-ambiguous evaluation:
front
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.262  &   2.961  &  10.989  &   0.398  &   0.527  &   0.791  &   0.889  \\
front_left
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.301  &   2.457  &   7.887  &   0.398  &   0.532  &   0.788  &   0.893  \\
back_left
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.289  &   2.230  &   7.050  &   0.386  &   0.578  &   0.799  &   0.895  \\
back
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.329  &   3.483  &  11.492  &   0.477  &   0.405  &   0.712  &   0.850  \\
back_right
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.298  &   2.418  &   7.371  &   0.405  &   0.556  &   0.789  &   0.887  \\
front_right
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.305  &   2.714  &   8.292  &   0.417  &   0.530  &   0.778  &   0.882  \\
all
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.297  &   2.711  &   8.847  &   0.413  &   0.522  &   0.776  &   0.883  \\
scale-aware evaluation:
front
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   1.976  &  42.443  &  22.277  &   1.090  &   0.048  &   0.107  &   0.194  \\
front_left
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   2.300  &  49.173  &  21.861  &   1.169  &   0.036  &   0.081  &   0.152  \\
back_left
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   2.392  &  51.237  &  21.708  &   1.186  &   0.029  &   0.070  &   0.135  \\
back
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   1.679  &  26.246  &  15.413  &   0.992  &   0.092  &   0.190  &   0.295  \\
back_right
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   2.569  &  61.392  &  22.064  &   1.214  &   0.031  &   0.075  &   0.143  \\
front_right
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   2.476  &  57.699  &  22.465  &   1.206  &   0.037  &   0.087  &   0.154  \\
all
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   2.232  &  48.032  &  20.965  &   1.143  &   0.046  &   0.102  &   0.179  \\

I export the GT by using tools/export_gt_depth_nusc.py with val and used configs/nusc_scale_pretrain.txt for evaluation (most config stay same except I changed the min_depth to 0.5).

Is this reasonable or where I am using it wrong, thank you.

weiyithu commented 1 year ago

Hi, I think there should be something wrong. For the scale-aware mode, the 'median' should be around 1. You should use nusc_scale.txt for evaluation and I wonder if this cause the evaluation wrong.

kyoyachuan commented 1 year ago

I figured out the main issue. The root cause is min_depth which originally set as 0.1 instead of 0.5. I thought min_depth or max_depth was used for GT filtering and clamp the predictions only, but it was also used in recover the scale of disparity since the model's output was sigmoid.

https://github.com/weiyithu/SurroundDepth/blob/22dfecfe8fca62a38d0f682ff7bf65b41aba3cac/runer.py#L359-L367

After I changed it back to 0.1, the results was correct.

Loading depth weights...
Loading encoder weights...
Training model named: nusc_scale
There are 20096 training items and 6019 validation items

median: 1.1010782718658447
-> Evaluating 1
scale-ambiguous evaluation:
front
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.177  &   2.176  &   7.423  &   0.264  &   0.773  &   0.916  &   0.963  \\
front_left
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.282  &   3.206  &   7.126  &   0.348  &   0.656  &   0.837  &   0.914  \\
back_left
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.275  &   2.963  &   6.447  &   0.345  &   0.675  &   0.847  &   0.916  \\
back
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.219  &   2.330  &   7.382  &   0.310  &   0.706  &   0.884  &   0.947  \\
back_right
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.351  &   6.161  &   7.489  &   0.393  &   0.632  &   0.828  &   0.905  \\
front_right
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.324  &   5.660  &   7.809  &   0.383  &   0.646  &   0.835  &   0.911  \\
all
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.271  &   3.749  &   7.279  &   0.341  &   0.681  &   0.858  &   0.926  \\
scale-aware evaluation:
front
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.180  &   2.192  &   7.637  &   0.282  &   0.744  &   0.906  &   0.959  \\
front_left
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.261  &   2.751  &   7.099  &   0.371  &   0.646  &   0.832  &   0.908  \\
back_left
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.282  &   2.820  &   6.567  &   0.374  &   0.629  &   0.824  &   0.905  \\
back
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.213  &   2.343  &   7.667  &   0.326  &   0.692  &   0.872  &   0.941  \\
back_right
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.403  &   9.329  &   7.768  &   0.426  &   0.611  &   0.807  &   0.889  \\
front_right
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.339  &   6.971  &   8.062  &   0.407  &   0.643  &   0.823  &   0.898  \\
all
 abs_rel |   sq_rel |     rmse | rmse_log |       a1 |       a2 |       a3 | 
&   0.280  &   4.401  &   7.467  &   0.364  &   0.661  &   0.844  &   0.917  \\

I also tried with hardcoded min_depth in disp_to_depth with 0.1 and use different setting in configs, the results was also reasonable.

Seems like min_depth and max_depth was highly dependent on your training setup, I suggest that should be mention in documents. Thanks :)

weiyithu / SurroundDepth

Question about scale-aware model type nuScenes evaluation #15