ywyue / FiT3D

[ECCV 2024] Improving 2D Feature Representations by 3D-Aware Fine-Tuning
MIT License
192 stars 6 forks source link
3d-awareness clip deit-iii dinov2 fine-tuning foundation-models mae representation-learning

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

ECCV 2024

Yuanwen Yue 1, Anurag Das 2, Francis Engelmann 1,3, Siyu Tang 1, Jan Eric Lenssen 2

1ETH Zurich, 2Max Planck Institute for Informatics, 3Google

Project Page | Paper

Open In Colab


This is the official repository (under construction) for the paper Improving 2D Feature Representations by 3D-Aware Fine-Tuning.


Table of Contents
  1. Demo
  2. Preparation
  3. Training
  4. Evaluation
  5. Citation


We provide a Colab Notebook with step-by-step guides to make inference and visualize the PCA features and K-Means clustering of original 2D models and our fine-tuned models. We also provide an online Hugging Face demo 🤗 where users can upload their own images and check the visualizations online. Alternatively, to run the demo locally, just try python app.py.




We train feature Gaussians and fine-tuning on ScanNet++ scenes. Preprocessing code and instructions are here. After preprocessing, the ScanNet++ data is expected to be organized as following:

└── db/
    └── scannetpp/
        ├── metadata/
        |    ├── nvs_sem_train.txt  # Training set for NVS and semantic tasks with 230 scenes
        |    ├── nvs_sem_val.txt # Validation set for NVS and semantic tasks with 50 scenes
        |    ├── train_samples.txt  # Training sample list, formatted as sceneID_imageID
        |    ├── val_samples.txt # Validation sample list, formatted as sceneID_imageID
        |    ├── train_view_info.npy  # Training sample camera info, e.g. projection matrices
        |    └── val_view_info.npy # Validation sample camera info, e.g. projection matrices
        └── scenes/
            ├── 0a5c013435  # scene id
            ├── ...
            └── 0a7cc12c0e
              ├── images  # undistorted and downscaled images
              ├── masks # undistorted and downscaled anonymized masks
              ├── points3D.txt  # 3D feature points used by COLMAP
              └── transforms_train.json # camera poses in the format used by Nerfstudio

For all other evaluation datasets (ScanNet, NYUd, NYUv2, ADE20k, Pascal VOC, KITTI), please follow their official websites for downloading instructions.


Stage I: Lifting Features to 3D

Example command to train the feature Gaussians for a single scene:

python train_feat_gaussian.py --run_name=example_feature_gaussian_training \
                    --model_name=dinov2_small \
                    --source_path=db/scannetpp/scenes/0a5c013435 \

model_name indicates the 2D feature extractor and can be selected from dinov2_small, dinov2_reg_small, clip_base, mae_base, deit3_base. low_sem_dim is the dimension of the semantic feature vector attached to each Gaussian. Note it should have the same value with NUM_CHANNELS_FEAT in submodules/diff-feature-gaussian-rasterization/cuda_rasterizer/config.h.

To generate the commands for training Gaussians for all scenes in ScanNet++, run:

python gen_commands.py --train_fgs_commands_folder=train_fgs_commands --model_name=dinov2_small --low_sem_dim=64

Training commands for all scenes will be stored in train_fgs_commands.

After training, we need to write the parameters of all feature Gaussians to a single file, which will be used in the 2nd stage. To do that, run:

python write_feat_gaussian.py

After that, all the pretrained Gaussians of training scenes are stored as pretrained_feat_gaussians_train.pth and all the pretrained Gaussians of validation scenes are stored as pretrained_feat_gaussians_val.pth. Both files will be stored in db/scannetpp/metadata.

Stage II: Fine-Tuning

In this stage, we use the pretrained Gaussians to render features and use those features as target to finetune the 2D feature extractor. To do that, run

python finetune.py --model_name=dinov2_small \
                   --output_dir=output_finemodel \
                   --job_name=finetuning_dinov2_small \
                   --train_gaussian_list=db/scannetpp/metadata/pretrained_feat_gaussians_train.pth \

model_name indicates the 2D feature extractor and should be consistent with the feature extractor used in the first stage. The default fine-tuning epoch is 1, after which the weights of the finetuned model will be saved in output_dir/date_job_name.



If you find our code or paper useful, please cite:

  title     = {{Improving 2D Feature Representations by 3D-Aware Fine-Tuning}},
  author    = {Yue, Yuanwen and Das, Anurag and Engelmann, Francis and Tang, Siyu and Lenssen, Jan Eric},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2024}