zhyever / PatchFusion

[CVPR 2024] An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
https://zhyever.github.io/patchfusion/
MIT License
926 stars 62 forks source link

Several questions #22

Closed vitacon closed 3 months ago

vitacon commented 4 months ago

Hello @zhyever, I started experimenting with PathFusion yesterday (Windows 10 + Anaconda) and I still don't understand some things.

  1. It seemed the import of .env passed (the output was so long I could see only the end of it) but the demo still crashed. I had to (re?)install several packges by hand (torchvision, torch and timm), but that's not really a question. =)

  2. infer_user.py throws quite a lot of warnings so the console becomes very messy:

    
    ./infer_user.py:934: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
    img = torch.tensor(images.rgb_image).unsqueeze(dim=0).permute(0, 3, 1, 2) # shape: 1, 3, h, w
    ./infer_user.py:386: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
    blur_mask = torch.tensor(blur_mask, device=whole_depth_pred.device)
    ./infer_user.py:393: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
    blur_mask = torch.tensor(blur_mask, device=whole_depth_pred.device)
Do you have any plans to address it in your code?

3. I am puzzled by the information of resolutions:

Params passed to Resize transform: width: 512 height: 384 resize_target: True keep_aspect_ratio: False ensure_multiple_of: 32 resize_method: minimal

Current image resolution: (2160, 3840) Current crop size: (540, 960) build pretrained condition model from None img_size [384, 512]


I understood that PatchFusion uses a 4K "workspace" (3840 x 2160) so 960 x 540 kind of makes sense. However, my images are 1490 x 1080 so the line saying "img_size [384, 512]" is quite confusing. What does it really mean?

4. Your storage contains two models. I suppose "u4k" means "Ultra HD / 4K", but I can't figure out what means "gta"..?

5. What is the point of saving the output in 4K when the input is smaller and it has a different aspect ratio? I expect to get the result in the same resolution as the input.

6. I don't really need 4K resolution at the moment. Is there a way how to force PathFusion to work in 1920 x 1080 instead of 3840 x 2160 and possibly save some time?

Sorry for the length of this post. =/
zhyever commented 4 months ago

Hi @vitacon. Thank you so much for the interest in our project and I feel so sorry for the trouble you are facing with. Let me try to explain to you one by one. Hope it will be helpful.

  1. I have to say that the env used in this project also bothered me. Because it relies on many different repos, it's kind of messy.
  2. These warnings could be simply addressed for sure.
  3. What you understand is correct. The cropsize is 960x540 for 4k images. Here, the (384, 512) is the default processing resolution used in our baseline depth model ZoeDepth. It means, after cropping 960x540 patches, we'll resize patches to (384, 512) and pass them through the base depth estimator. The predicted depth is then resized back to 960x540 for re-ensemble.
  4. Sorry for the confusion. The u4k means the model was trained on unreal4kstereo dataset. The gta denotes the model trained on MVS-Synth dataset. This MVS-Synth dataset contains GTA style images. (https://phuang17.github.io/DeepMVS/mvs-synth.html)
  5. Thank you for the suggestion. Get it.
  6. It makes a lot of sense. As for 2k images, you can split them to 2x2 patches covered the whole image, which will save time. However, it would be tough to modify the codes right now, because there are many hacking codes...

Here is some information might be useful: We are doing some work based on patchfusion and we entirely reconstruct the current code base. We are using torch2.0+ so the env would be more friendly. After rebuilding, the log is also better. The model can also support various number of splitting patches (like using 2x2 patches for 2k image or more user-friendly settings). It would be released soon.

vitacon commented 4 months ago

Thanks @zhyever!

Just a few additional notes:

  1. Actually, "Grand Theft Auto" was the only meaning I could think of but (not knowing the dataset) it did not make any sense to me. =} I know that writing documentation is quite annoying but I think the two models should be explained in readme.md. By the way, did you perform any numeric comparison of their outputs?

  2. Hm, I suppose I could use Image Magick, merge 4 images together (2x2), feed it to PatchFusion and then split the output back to 4 images. Or learn a bit of Python syntax and do the same thing directly in infer_user.py. However, if the new version is going to be released soon, I think I will just wait for it. =)

zhyever commented 4 months ago

Thanks.

Ideally, the new release would be at the end of March. Thanks for your advices and patience.

vitacon commented 4 months ago

Thanks again @zhyever!

noobtoob4lyfe commented 4 months ago

"It makes a lot of sense. As for 2k images, you can split them to 2x2 patches covered the whole image, which will save time. However, it would be tough to modify the codes right now, because there are many hacking codes..."

Where would this be done? Is there an argument for this?

zhyever commented 4 months ago

@vitacon @noobtoob4lyfe Please check the current version of inference instructions. All of discussed items are supported now

noobtoob4lyfe commented 4 months ago

@vitacon I'm not seeing the argument for changing the number of patches to 2x2.

zhyever commented 4 months ago

Please check this example usage in the readme