zhyever / PatchFusion

[CVPR 2024] An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
https://zhyever.github.io/patchfusion/
MIT License
958 stars 64 forks source link

custom coarse depth input #2

Closed kwea123 closed 10 months ago

kwea123 commented 10 months ago

Does it support custom depth as base input? e.g. I have an image and a coarse depth, how can I get the high res depth from them?

zhyever commented 10 months ago

In this case, you need to train your own patchfusion model by simply replacing the ZoeDepth with your custom depth model. While we have released the training codes, we need some time to organize the docs for training.

mr-lab commented 10 months ago

if you want to feed your own depth maps example on the R mode line 944 to feed your depthmaps from folder(use the same file name structure) change it to :

pred_depth, count_map = regular_tile(model, img, offset_x=0, offset_y=0, img_lr=img_lr) avg_depth_map = RunningAverageMap(pred_depth, count_map) avg_depth_map = regular_tile(model, img, offset_x=0, offset_y=0, img_lr=img_lr)

or simply Line 946 :

avg_depth_map = regular_tile(model, img, offset_x=0, offset_y=0, img_lr=img_lr, iter_pred=YourDepthmap here )

the code have the ability to input other depthmaps but this is a Zoe Depth bases code so if you feed midas or any other predicted depth you will have really bad results since they estimate depth differently. so you need to train your own patchfusion model in that case but if your depth maps are similar to this project you can use the code i'm just reading the code i have not tested any of this. it's similar to https://github.com/BillFSmith/TilingZoeDepth

but so much work put into this i mean the tiles number is far more that , and the U4K model is really good BTW some predictions from non 3d / realistic inputs are very blurry like a painting , this is due to the U4K model trained on 3d i think . it can be fixed if we use ZoeDepth as a base at first then run U4K model or the other way around , the code have all the possibility it's just one # away thank you zhyever this will be put in really good use ,<3<3 , zhyever please add the # option in the code those are really good extra options !!!

zhyever commented 10 months ago

Thanks for your advice and interest in our work! Please let me somehow post some explanations about this case.

Consider an hr image I and one of its tiled patch A.

As for TilingZoeDepth, the pipeline is like:

coarse_A = crop(zoe_pipe(I))
fine_A = zoe_pipe(crop(I))

where zoe_pipe is a combination of resize and ZoeDepth forward. Then, the author uses some post-optimization strategies to combine this coarse_A and fine_A.

As for PatchFusion, the pipeline is like:

coarse_A = crop(coarse_zoe_pipe(I))
fine_A = fine_zoe_pipe(crop(I))

fused_A = fusion_net(crop(I), coarse_A, fine_A, ...), where ... denotes some guidance features from coarse_zoe and fine_zoe

Finally, we could just stitch all the fused_A together without post-optimization strategies. As you could see, while we also have a coarse_A and fine_A, they are not explicitly involved in the final prediction.

However, your suggestion is still meaningful! Because for best visualization quality for users, we consider adding a blur_mask for predictions. Coarse predictions could be used to fill in the masked areas on the fly.

kwea123 commented 10 months ago

Thank you, this means theoretically we don't necessarily need to train fusion_net on our own data? I would like to see how zero-shot generalization it has by providing own depth inputs. I don't have the bandwidth to try now, if someone has result please kindly share it here :)

mr-lab commented 9 months ago

Thanks for your advice and interest in our work! Please let me somehow post some explanations about this case.

Consider an hr image I and one of its tiled patch A.

As for TilingZoeDepth, the pipeline is like:

coarse_A = crop(zoe_pipe(I))
fine_A = zoe_pipe(crop(I))

where zoe_pipe is a combination of resize and ZoeDepth forward. Then, the author uses some post-optimization strategies to combine this coarse_A and fine_A.

As for PatchFusion, the pipeline is like:

coarse_A = crop(coarse_zoe_pipe(I))
fine_A = fine_zoe_pipe(crop(I))

fused_A = fusion_net(crop(I), coarse_A, fine_A, ...), where ... denotes some guidance features from coarse_zoe and fine_zoe

Finally, we could just stitch all the fused_A together without post-optimization strategies. As you could see, while we also have a coarse_A and fine_A, they are not explicitly involved in the final prediction.

However, your suggestion is still meaningful! Because for best visualization quality for users, we consider adding a blur_mask for predictions. Coarse predictions could be used to fill in the masked areas on the fly.

it's very insightful , and thank you for the explanation , your method is different , but i can't help but wonder ,this can be better below is our results using something similar but with a different model and BillFSmith masking logic
source image RGB:https://imgur.com/u4xEuD5 our result using : https://imgur.com/CV8RGNs (you can see there is a flaw with the nose ) PatchFusion results : https://imgur.com/Zs64i3s now that i'm starting to get familiar with how this works , i will work on implementing BillFSmith masking logic to your project , as i'm sure your U4K model will give far far better results . i think the blur_mask is the achilles heel here this is without it : https://imgur.com/YQqLUTk