zhyever / PatchFusion

[CVPR 2024] An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
https://zhyever.github.io/patchfusion/
MIT License
958 stars 64 forks source link

about training on my own depth maps #23

Closed wowkie83 closed 6 months ago

wowkie83 commented 6 months ago

Thank you for this awesome work!

I just wonder if you would provide some documentation for the training/finetuning. I am really looking forward to training this network on my own depth dataset, which i already trained on zoedepth, and I am pretty sure that it would improve my results!

zhyever commented 6 months ago

Thanks for your interest. I'm currently refactoring this codebase. As far as I see (from the community), docs is somehow more important than only providing codes :>. I will include some instructions for training/finetuning this time. I would be happy to hear if you have any more suggestions or feature requests. Like https://github.com/zhyever/PatchFusion/issues/22, https://github.com/zhyever/PatchFusion/issues/8, https://github.com/zhyever/PatchFusion/issues/16

wowkie83 commented 6 months ago

Thank you so much for your answer! All I need is the instructions about training/fine-tuning:D If you could explain the training/finetuning of input images with different sizes(e.g. my input image & depth maps are in[1680*1200]), that would be even better! While I have some confusion about whether different image sizes will affect the division of patches and the effectiveness of training. In addition,for the case where the input is grayscale images rather than rgb images, do you have any suggestions on preprocessing or training?

zhyever commented 6 months ago

So first, how to decide a proper patch size? It would be an interesting question. From my view, boostingdepth adopts some heristic strategies for this but it's for relative depth. For metric depth, it would be hard to use different patch sizes (but it would be possible if you adopt some new techs like metic3d and etc) (why difficult? I recommand to refer to my paper STMono3D. There are some explanations about camera parameters and solutions like metic3d), so we use fixed patch size as an initial baseline.

So, how to decide? It highly depends on your datasets! I would say for some indoor environments, a larger patch size is required because the model needs more info for predict a reasonable depth. For outdoor environments, a smaller one would be good. A simple strategy is trying different patch sizes and decide a best one for your dataset.

Of course, changing the patch size means changing the division of patches. It will be supported in the new codebase.

For grayscale image, I would say you can convert the gray one to rgb using some common tools like opencv. Some codes like:

image = cv2.imread(osp.join(self.data_root, file), cv2.IMREAD_GRAYSCALE) 
image = cv2.cvtColor(image, cv2.COLOR_GRAY2RGB)
wowkie83 commented 6 months ago

Thank you so much for your patient answers! and looking forward to the new release :D