Downstream task reproduce

ywyue / FiT3D

[ECCV 2024] Improving 2D Feature Representations by 3D-Aware Fine-Tuning

MIT License

229 stars 8 forks source link

Hello, thank you for the excellent research and contributions.

I have some questions regarding downstream task training and evaluation.

Using Colab, I performed depth estimation on KITTI and NYU datasets with the DINOv2 pretrained ViT-S/14 and ViT-B/14 original models, as well as with the FiT3D fine-tuned models. I followed the repositories provided at DINOv2 and Monocular-Depth-Estimation-Toolbox.

I trained with the same configuration as in the paper for comparison.

Screenshot 2024-10-28 at 5 13 29 PM

While the RMSE values from the original model closely match those in the paper:

ViT-S/14 (KITTI: 3.0721, NYU: 0.4416)
ViT-B/14 (KITTI: 2.9617, NYU: 0.3999)

the FiT3D fine-tuned models show performance below what is reported:

ViT-S/14 (KITTI: 3.6037, NYU: 0.5057)
ViT-B/14 (KITTI: 3.1444, NYU: 0.4907)

Is there something I might have missed during reproduction? The weights appear to load correctly, and I used the Colab pre-load model section directly.

Additionally, could you share any details on how you configured the ADE20k and Pascal VOC downstream tasks or if there are plans to release downstream task training/evaluation code?

I am considering using the mmsegmentation repository, similar to how I used the MDE toolbox.

Thank you for your time and help!

Hi @jo1jun, thanks for your interest in our work.

In the Colab demo and all the visualizations of feature maps and k-means clustering in the paper, we only use the fine-tuned features. However, for experiments on linear probing evaluation, we combine the fine-tuned features with the original features. As the 2D models were only fine-tuned on a small-scale indoor dataset ScanNet++, their generalization may degrade, which is one limitation of our work. We found that simply concatenating the original 2D features with fine-tuned features can preserve their generalization while incorporating 3D awareness. Related discussion can be found in:

Feature assembly in Sec. 3.3
Limitations and discussion in Sec. 4.6
Appendix C. Experiments on Feature Dimensions

We conducted the ADE20k and Pascal VOC segmentation tasks using the mmsegmentation library. For setup, please refer to DINOv2's config:

ViT-B/14:
- ADE20k: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_ade20k_linear_config.py
- Pascal VOC: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_voc2012_linear_config.py
ViT-S/14:
- ADE20k: https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_ade20k_linear_config.py
- Pascal VOC: https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_voc2012_linear_config.py

I am cleaning the linear probing evaluation code and will try to release it in this week.

ywyue / FiT3D

Downstream task reproduce #7