ygjwd12345 / TransDepth

Code for Transformers Solve Limited Receptive Field for Monocular Depth Prediction
MIT License
171 stars 20 forks source link

Performance gap between baseline method: BTS #13

Closed zhyever closed 2 years ago

zhyever commented 2 years ago

Hi, thanks for the great work. When I read your paper, I find: "We choose the ResNet-50 with the same prediction head as our baseline", but there are no words about the "decoder head" design, so I come to GitHub to figure it out. I find your method bases on BTS and uses its decoder:

https://github.com/ygjwd12345/TransDepth/blob/3ae116f045243f24c72a4fc558634d0cf823fd1b/pytorch/bts.py#L347

So, "We choose the ResNet-50 with the same prediction head as our baseline" means you replace the BTS encoder with ResNet-50, and preserve other setting the same. I recently reproduced the BTS with their official code, so I am a little bit familiar with its quantitative results. Although the result of the baseline on the NYU dataset is similar to the one reported in BTS, when it comes to the KITTI, I find that your baseline result is much lower than the one reported in BTS. As follows:

NYU: (Abs rel, RMSE, a1, a2, a3) Your report: 0.118 0.414 0.866 0.979 0.995 (TransDepth, Table.2, Baseline) BTS report: 0.119 0.419 0.865 0.975 0.993 (BTS, Table. 5, ResNet-50)

KITTI: (Abs rel, RMSE, a1, a2, a3) Your report: 0.106 3.981 0.888 0.967 0.986 (TransDepth, Table.1, Baseline) BTS report: 0.061 2.803 0.954 0.992 0.998 (BTS, Table. 6, ResNet-50)

May I ask if I misunderstood, or did you use a different setting from the BTS?

ygjwd12345 commented 2 years ago

our bts is self.decoder = bts(params, [64, 128, 256, 512, 1024], params.bts_size)

zhyever commented 2 years ago

our bts is self.decoder = bts(params, [64, 128, 256, 512, 1024], params.bts_size)

I wonder why the ResNet50 outputs are 64, 128, 256, 512, 1024 channels? 64, 256, 512, 1024, 2048 one is more standard. Since I just refer to your codes at TransDepth/pytorch/bts.py, line 347, is that means the TransDepth bases on the ResNet50 whose outputs are 64, 256, 512, 1024, 2048 channels and, as you say, the baseline bases on the ResNet50 whose outputs are 64, 128, 256, 512, 1024 channels?

zhyever commented 2 years ago

Sorry to the border. I got the reason! Thank you a lot for explaining in detail.

alanyannick commented 2 years ago

Sorry to the border. I got the reason! Thank you a lot for explaining in detail.

Hi zhyever, Same question here, what's the reason? and it seems the decoder is actually the size of [64, 256, 512, 1024, 2048] rather than [64, 128, 256, 512, 1024].

zhyever commented 2 years ago

Sorry to the border. I got the reason! Thank you a lot for explaining in detail.

Hi zhyever, Same question here, what's the reason? and it seems the decoder is actually the size of [64, 256, 512, 1024, 2048] rather than [64, 128, 256, 512, 1024].

I sent you an email. Sorry for the long reply interval.

ygjwd12345 commented 2 years ago

Sorry to the border. I got the reason! Thank you a lot for explaining in detail.

Hi zhyever, Same question here, what's the reason? and it seems the decoder is actually the size of [64, 256, 512, 1024, 2048] rather than [64, 128, 256, 512, 1024].

In order to fit our Resnet-50's output size, we change the decoder parameter.