zxcqlf / MonoViT

Self-supervised monocular depth estimation with a vision transformer
MIT License
148 stars 17 forks source link

Training from scratch custom data #10

Open jordigc2 opened 1 year ago

jordigc2 commented 1 year ago

Hello Zhaocq,

First of all thank you very much for sharing your amazing work. I managed to integrate the model into the Monodepth training pipeline and train the model using the pre-trained weights that you provided. Even though, the results obtained were not great as with far object/scenes the model was not managing to keep a smooth disparity or properly interpreting the scene. The top-right image corresponds to the ZED2 output depth. image

My goal is to be able to train it from "scratch" (using the pre-trained models with ImageNet) as I would like to include a semi-supervised term into the loss function using GT Depth (LIDAR or ZED2 depth) information to be able to get metric depth and pose and keep consistency with far elements of the image. Inspired by this paper https://arxiv.org/pdf/1910.01765.pdf

Here I attach some sample images that I am using for training:

image image image image

I am using the same parameters that you are mentioning in the experiments section when you explain how you trained from scratch, using the pre-trained models from ImageNet and combined with the information provided in the multiple papers of Monodepth and Monodepth2. image

When starting training the first mini-batch output is as shown in the following image: image

However the following mini-batches from the same epoch are as follows, which means that the NN is not properly being trained: image image image

I am using the ZED2 camera which has a baseline of 12cm between each lenses and using rectified images before inference.

With this provided information would you be able to understand what I may be missing from the papers and code provided that is not making the model train?

Thank you very much for your time.

jordigc2 commented 1 year ago

If I reduce the Lr to 1e-4 and 5e-5 respectively from the ones mentioned in the screenshot of the paper the model does not saturate but when wrapping the images from left to right or vice-versa it gets this distortion on the edges.

Wrapped image from left-right image

Depth image after some mini-batches initial epoches: image

After 14 epochs the depth image does not improve much from the previous one and it gets closer to a saturated depth image and the loss becomes constant.

Input and Estimated Depth: image image

Loss 14 epochs: image

Thank you again for your time.

zxcqlf commented 1 year ago

So you mean you only use the left-right images for self-supervised training? not the sequential images? For the warped images, it is reasonable, because there are occlusions between left-right images. For the depth, you know that self-supervised training is constructed on photometric consistency, those shadow regions affect the texture correspondence between images, resulting in wrong depth inferring results!

jordigc2 commented 1 year ago

Hello, I am also using the sequential images. I know that there are occlusions due to the stereo baseline, but it is only 12cm and the drift of the wrapped images is really big and never decreases when training from scratch. However, when using the pre-trained models you shared this drift are is smaller.

So probably shadows are playing an important role in order to make the model learn? I will investigate how to mitigate the shadows noise before performing the photo-metric consistency. Are you aware of where can I start investigating this? Maybe that you may have considered looking at some point?

But the shadows between frames -1 and 1 compared with frame 0 do not change much, I do not understand how the shadows can affect the texture correspondence between images.

Thank you very much. Jordi