zxcqlf / MonoViT

Self-supervised monocular depth estimation with a vision transformer
MIT License
157 stars 18 forks source link

Questions about the size of feature maps #23

Open Shaw-Way opened 10 months ago

Shaw-Way commented 10 months ago

Hello, author, thanks for your remarkable work. I noticed that you changed the stride(from 2 to 1) of the second conv block of stem block to get a H / 2 × W / 2 feature map. And after the first "Joint CNN & Transformer Layer", the feature map downsample twice again to H / 4 × W / 4 . But according to the paper of MPViT, it seems the first "Joint CNN & Transformer Layer" won't change the height and width of feature map. Did you make any additional changes? 1700108370092 1700108461215

zxcqlf commented 9 months ago

We revised the encoder part, see our code for more details