This repository contains the official Pytorch implementation of our papers:
"Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization (ICML 2023, Poster)" [Paper1
]
Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang
"Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data (TPAMI 2024)" [Paper2
]
Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Yu-Gang Jiang
We introduce a simple yet effective approach, Open-VCLIP, which transforms CLIP into strong zero-shot video classifiers and can better recognize unseen actions and events at test time. The extended version Open-VCLIP++ is comming soon. The self-generated captions for Kinetics-400 have been uploaded in folder "/blip_llama2_caption".
The main dependent packages include: PyTorch 1.11.0 and torchvision 0.12.0 and PySlowFast
Detailed Installation instruction can be viewed in INSTALL.md
.
We upload the checkpoints of Open-VCLIP, which can be downloaded through the following links:
Kinetics-400.
We obtained the compressed version Kinetics-400 dataset, where videos have been resized to 256, from the VoV3d Repo
. The repository provides the download link for the dataset: [Kinetics-400 dataset link
]. After downloading and extracting the data, you should rename the folders "train_256" and "val_256" to "train" and "val" respectively. Additionally, please note that the video "val/crossing_river/ZVdAl-yh9m0.mp4" is invalid and needs to be replaced. You should download a new version of the video from here
and perform the replacement.
UCF-101.
We download UCF-101 dataset by the script
provided by MMAction2.
HMDB-51.
We donwload HMDB-51 dataset by the script
provided by MMAction2.
Kinetics-600 testing.
Validation data of Kinetics-600 we used can be donwloaded from link
.
Training scripts of OpenVCLIP are provided in /script/training. An example script can be viewed bellow. After that, you can run weight_average.py to perform the SWA operations.
ROOT=/PATH/TO/Open-VCLIP
CKPT=/PATH/FOR/SAVING/CKPT/
cd $ROOT
# DATA.PATH_PREFIX # need to replace with the path of the dataset
# TRAIN.CLIP_ORI_PATH # need to replace with the path of CLIP weights
# MODEL.TEMPORAL_MODELING_TYPE # selection of temporal modeling module
python -W ignore -u tools/run_net.py \
--cfg configs/Kinetics/TemporalCLIP_vitb16_8x16_STAdapter.yaml \
--opts DATA.PATH_TO_DATA_DIR $ROOT/label_db/weng_compress_full_splits \
DATA.PATH_PREFIX /dev/shm/k400 \
DATA.PATH_LABEL_SEPARATOR , \
DATA.INDEX_LABEL_MAPPING_FILE $ROOT/label_db/k400-index2cls.json \
TRAIN.ENABLE True \
OUTPUT_DIR $CKPT/basetraining/temporalclip_vitb16_8x16_interpolation_bugfix_0.5ratio_rand0.0_0.6sample \
TRAIN.BATCH_SIZE 64 \
TEST.BATCH_SIZE 240 \
TEST.NUM_ENSEMBLE_VIEWS 3 \
TEST.NUM_SPATIAL_CROPS 1 \
NUM_GPUS 8 \
SOLVER.MAX_EPOCH 22 \
SOLVER.WARMUP_EPOCHS 2.0 \
SOLVER.BASE_LR 3.33e-6 \
SOLVER.WARMUP_START_LR 3.33e-8 \
SOLVER.COSINE_END_LR 3.33e-8 \
TRAIN.MIXED_PRECISION True \
DATA.DECODING_BACKEND "pyav" \
MODEL.NUM_CLASSES 400 \
MODEL.TEMPORAL_MODELING_TYPE 'expand_temporal_view' \
MIXUP.ENABLE False \
AUG.ENABLE False \
AUG.NUM_SAMPLE 1 \
TRAIN.EVAL_PERIOD 1 \
TRAIN.CHECKPOINT_PERIOD 1 \
MODEL.LOSS_FUNC soft_cross_entropy \
TRAIN.LINEAR_CONNECT_CLIMB True \
TRAIN.CLIP_ORI_PATH ~/.cache/clip/ViT-B-16.pt \
TRAIN.LINEAR_CONNECT_LOSS_RATIO 0.5 \
TRAIN.LINEAR_CONNECT_SAMPLE_L 0.0 \
TRAIN.LINEAR_CONNECT_SAMPLE_R 0.6 \
Download checkpoints and evaluate models by scripts in folder "/script/testing/". An example script can be viewed bellow.
Note: Changing the value of TEST.PATCHING_RATIO will change the weight interpolation factors.
ROOT=/PATH/TO/Open-VCLIP
CKPT=/PATH/FOR/SAVING/CKPT/
OUT_DIR=$CKPT/testing
LOAD_CKPT_FILE=/PATH/TO/openvclip-b16/swa_2_22.pth
PATCHING_RATIO=0.5
# DATA.PATH_TO_DATA_DIR $ROOT/zs_label_db/ucf101_full \ # option: ucf101full / ucf101_split1 / ucf101_split2 / ucf101_split3 /
# DATA.PATH_PREFIX # need to replace with the path of the dataset
# TEST.CUSTOM_LOAD_FILE # path of checkpoint to be loaded
# TEST.PATCHING_RATIO # relates to the patching ratio: [old_w * ratio + new_w * (1 - ratio)]
# TEST.CLIP_ORI_PATH # need to replace with the path of CLIP weights
# MODEL.TEMPORAL_MODELING_TYPE # selection of temporal modeling module
cd $ROOT
python -W ignore -u tools/run_net.py \
--cfg configs/Kinetics/TemporalCLIP_vitb16_8x16_STAdapter.yaml \
--opts DATA.PATH_TO_DATA_DIR $ROOT/zs_label_db/ucf101_full \
DATA.PATH_PREFIX /dev/shm/ucf/UCF-101 \
DATA.PATH_LABEL_SEPARATOR , \
DATA.INDEX_LABEL_MAPPING_FILE $ROOT/zs_label_db/ucf101-index2cls.json \
TRAIN.ENABLE False \
OUTPUT_DIR $OUT_DIR \
TEST.BATCH_SIZE 480 \
NUM_GPUS 8 \
DATA.DECODING_BACKEND "pyav" \
MODEL.NUM_CLASSES 101 \
MODEL.TEMPORAL_MODELING_TYPE 'expand_temporal_view'
TEST.CUSTOM_LOAD True \
TEST.CUSTOM_LOAD_FILE $LOAD_CKPT_FILE \
TEST.SAVE_RESULTS_PATH temp.pyth \
TEST.NUM_ENSEMBLE_VIEWS 3 \
TEST.NUM_SPATIAL_CROPS 1 \
TEST.PATCHING_MODEL True \
TEST.PATCHING_RATIO $PATCHING_RATIO \
TEST.CLIP_ORI_PATH /root/.cache/clip/ViT-B-16.pt \
This repository is built upon PySlowFast
and CLIP
. Thanks for those well-organized codebases.
@inproceedings{weng2023transforming,
title={Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization},
author={Weng, Zejia and Yang, Xitong and Li, Ang and Wu, Zuxuan and Jiang, Yu-Gang},
booktitle={ICML},
year={2023}
}
@article{wu2024building,
title={Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data},
author={Wu, Zuxuan and Weng, Zejia and Peng, Wujian and Yang, Xitong and Li, Ang and Davis, Larry S and Jiang, Yu-Gang},
journal={TPAMI},
year={2024},
}