This is the official implementation of our 🚴 BIKE (BIdirectional Knowledge Exploration), which leverages cross-modal bridge to enhance video recognition by exploring bidirectional knowledge.
> [**Revisiting Classifier: Transferring Vision-Language Models for Video Recognition**](https://arxiv.org/abs/2207.01297)
> Wenhao Wu, Zhun Sun, Wanli Ouyang
> [![Conference](http://img.shields.io/badge/AAAI-2023-f9f107.svg)](https://ojs.aaai.org/index.php/AAAI/article/view/25386/25158) [![Journal](http://img.shields.io/badge/IJCV-2023-Bf107.svg)](https://link.springer.com/article/10.1007/s11263-023-01876-w) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/Text4Vis)
> [**Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?**](https://arxiv.org/abs/2301.00184)
> Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
> Accepted by CVPR 2023 as 🌟Highlight🌟 | [![Conference](http://img.shields.io/badge/CVPR-2023-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Cap4Video_What_Can_Auxiliary_Captions_Do_for_Text-Video_Retrieval_CVPR_2023_paper.html) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/Cap4Video)
Apr 26, 2023
: All models, configs and training logs have been released.Apr 20, 2023
: Main training codes have been released, including single-node/multi-node multi-GPU distributed training. Thanks for your star 😝.Feb 28, 2023
: 🎉Our BIKE has been accepted by CVPR-2023.🚴BIKE explores bidirectional cross-modal knowledge from the pre-trained vision-language model (e.g., CLIP) to introduce auxiliary attributes and category-dependent temporal saliency for improved video recognition.
- [PyTorch](https://pytorch.org/) >= 1.8 - RandAugment - pprint - tqdm - dotmap - yaml - csv - Optional: decord (for on-the-fly video training) - Optional: torchnet (for mAP evaluation on ActivityNet)
(Recommend) To train all of our models, we extract videos into frames for fast reading. Please refer to MVFNet repo for the detailed guide of dataset processing.
The annotation file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace.
(Optional) We can also decode the videos in an online fashion using decord. This manner should work but are not tested. All of the models offered have been trained using offline frames.
Annotation information consists of two parts: video label, and category description.
During our training, we maintain a total batch size of 256. If your machine's GPU memory is not enough, you can reduce the batch size to reduce the memory usage. It is best to correspondingly increase the value of "grad_accumulation_steps" in the config file.
Architecture | Input | Batch Size (per GPU) x 8GPUs | Mem. (per GPU) |
---|---|---|---|
ViT-B/32 | 8x2242 | 32 x 8 = 256 | 6G |
ViT-B/16 | 8x2242 | 32 x 8 = 256 | 9G |
ViT-L/14 | 8x2242 | 32 x 8 = 256 | 18G |
ViT-L/14 | 16x2242 | 32 x 8 = 256 | 29G |
Architecture | Input | Views | Top-1(%) | checkpoint | Train log | config | ||
---|---|---|---|---|---|---|---|---|
ViT-B/32 V+A | 8x2242 | 1x1 | 81.4 | Score | log | config | ||
ViT-B/16 | 8x2242 | 4x3 | 84.0 | Github | log | config | ||
ViT-L/14* | 8x2242 | 4x3 | 87.4 | OneDrive | log | config | ||
ViT-L/14 | 16x2242 | 4x3 | 88.1 | OneDrive | log | config | ||
ViT-L/14 | 8x3362 | 4x3 | 88.3 | OneDrive | log | config | ||
ViT-L/14 | 16x3362 | 4x3 | 88.7 | OneDrive | log | config | ||
<!-- | ViT-L/14 | 32x3362 | 4x3 | xx.x | [OneDrive]() | log | config | --> |
Architecture | Input | Views | Top-1 (%) | mAP (%) | checkpoint | Train log | config |
---|---|---|---|---|---|---|---|
ViT-L/14 | 16x2242 | 4x1 | 94.4 | 96.3 | OneDrive | log | config |
ViT-L/14 | 16x3362 | 4x1 | 94.7 | 96.1 | OneDrive | log | config |
Architecture | Input | Views | mAP (%) | checkpoint | Train log | config |
---|---|---|---|---|---|---|
ViT-L/14 | 16x3362 | 4x1 | 50.7 | OneDrive | log | config |
Architecture | Input | Views | Top-1 (%) | checkpoint | Train log | config |
---|---|---|---|---|---|---|
ViT-L/14 | 16x2242 | 1x1 | 98.7 | OneDrive | log | config |
ViT-L/14 | 16x3362 | 1x1 | 98.9 | OneDrive | log | config |
Architecture | Input | Views | Top-1 (%) | checkpoint | Train log | config |
---|---|---|---|---|---|---|
ViT-L/14 | 16x2242 | 1x1 | 82.9 | OneDrive | log | config |
ViT-L/14 | 16x3362 | 1x1 | 84.3 | OneDrive | log | config |
This implementation supports Multi-GPU DistributedDataParallel
training, which is faster and simpler than DataParallel
training.
Note: The JSON file containing the attributes is already available at https://github.com/whwu95/BIKE/releases/tag/v1.0.
# We train the 8 Frames ViT-B/32 video model (i.e., video branch).
sh scripts/run_train.sh configs/k400/k400_train_rgb_vitb-32-f8.yaml
sh scripts/run_co_train.sh configs/k400/k400_train_video_attr_vitb-32-f8.yaml
<details><summary>2. Mulitple Machines: We also provide the script to train larger model with Mulitple Machines (e.g., 2 nodes have 16 GPUs).</summary>
```sh
# For example, we train the 8 Frames ViT-L/14-336 with 2 machines as follows:
# For first machine, you need to set the ip of your first machine as the --master_addr, --nnodes is 2.
# Compared with the Single-Machine training script, only one node_id needs to be added.
sh scripts/run_train_multinodes.sh configs/k400/configs/k400/k400_train_rgb_vitl-14-336-f8.yaml 0
# For second machine, --master_addr is still the ip of your first machine
sh scripts/run_train_multinodes.sh configs/k400/configs/k400/k400_train_rgb_vitl-14-336-f8.yaml 1
We support single-view validation (default) and multi-view (4x3 views) validation.
# The testing command for obtaining top-1/top-5 accuracy.
sh scripts/run_test.sh Your-Config.yaml Your-Trained-Model.pt
# The command for zero-shot evaluation is similar.
sh scripts/run_test_zeroshot.sh Your-Config.yaml Your-Trained-Model.pt
We provide more examples of testing commands below.
If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry😁.
@inproceedings{bike,
title={Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models},
author={Wu, Wenhao and Wang, Xiaohan and Luo, Haipeng and Wang, Jingdong and Yang, Yi and Ouyang, Wanli},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2023}
}
If you also find Text4Vis useful 😁, please cite the paper:
@article{text4vis,
title={Revisiting Classifier: Transferring Vision-Language Models for Video Recognition},
author={Wu, Wenhao and Sun, Zhun and Ouyang, Wanli},
booktitle={Proceedings of AAAI Conference on Artificial Intelligence (AAAI)},
year={2023}
}
This repository is built based on Text4Vis, ActionCLIP, and CLIP. Sincere thanks to their wonderful works.
For any question, please file an issue.