zhyever / PatchFusion

[CVPR 2024] An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
https://zhyever.github.io/patchfusion/
MIT License
958 stars 64 forks source link

About the sky and ceiling area problem #24

Closed javierztl closed 6 months ago

javierztl commented 6 months ago

@zhyever Thank you very much for your great work, which richly improved the depth of detail. In your previous version that used Zoedepth as a coarse model, I found the area of sky and ceiling have some blur or unreasonable depth, so I tried to train PatchFusion with DepthAnything by myself. But you finished this job first, so I tried your latest model(DepthAnything ViTL) yesterday, it has the same problem, but the coarse model result seems good. The pictures are shown below(left is the PatchFusion result, right is the PatchFusion coarse model result). I found you are using the mask to ignore the farthest area loss, but my suggestion is to set the far area values exceeding some threshold to the threshold value and calculate their loss together. 20240319-135919

noobtoob4lyfe commented 6 months ago

I came here to ask about the same issue. The course depth anything image consistently does a better job of addressing the sky. Can this be improved at all? image

zhyever commented 6 months ago

Hi. Thanks for the comments. As far as I can see, adopting a mask to ignore the out-of-range pixels would be a general strategy in metric depth estimation, and it can be seen in some benchmarks like kitti. Almost all of the methods don't process the sky area and leads to some random filled value there.

From my point of view, truncating the value larger than a thr would influence the model performance. At least, it changes the depth distribution, and on the other hand, introducing some geometry noise. I suggest to adopt a simple binary hard to do a sky segmentation. Specially, we can extend the current output channel from 1 (only depth) to 2 (depth + seg prob). But it would request the segmentation map for the sky area.

javierztl commented 6 months ago

yes, my method would lead to a worse performance but a better visual feeling in relative depth. Thank you for your suggestion, adding one more output head is a good point.