whwu95 / BIKE

【CVPR'2023】Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
https://arxiv.org/abs/2301.00182
MIT License
156 stars 18 forks source link
action-recognition cross-modal-learning video-language-understanding video-recognition video-understanding

【CVPR'2023】🚴 BIKE: Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

[![Conference](http://img.shields.io/badge/CVPR-2023-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Bidirectional_Cross-Modal_Knowledge_Exploration_for_Video_Recognition_With_Pre-Trained_Vision-Language_CVPR_2023_paper.html) [![arXiv](https://img.shields.io/badge/Arxiv-2311.15732-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2301.00182) [Wenhao Wu](https://whwu95.github.io/)1,2, [Xiaohan Wang](https://scholar.google.com/citations?user=iGA10XoAAAAJ&hl=en)3, [Haipeng Luo]()4, [Jingdong Wang](https://jingdongwang2017.github.io/)2, [Yi Yang](https://scholar.google.com/citations?user=RMSuNFwAAAAJ&hl=en)3, [Wanli Ouyang](https://wlouyang.github.io/)5,1 1[The University of Sydney](https://www.sydney.edu.au/), 2[Baidu](https://vis.baidu.com/#/), 3[ZJU](https://www.zju.edu.cn/english/), 4[UCAS](https://english.ucas.ac.cn/), 5[Shanghai AI Lab](https://www.shlab.org.cn/)

PWC PWC PWC PWC PWC PWC PWC PWC PWC

This is the official implementation of our 🚴 BIKE (BIdirectional Knowledge Exploration), which leverages cross-modal bridge to enhance video recognition by exploring bidirectional knowledge.

📣 I also have other cross-modal video projects that may interest you ✨.

> [**Revisiting Classifier: Transferring Vision-Language Models for Video Recognition**](https://arxiv.org/abs/2207.01297)
> Wenhao Wu, Zhun Sun, Wanli Ouyang
> [![Conference](http://img.shields.io/badge/AAAI-2023-f9f107.svg)](https://ojs.aaai.org/index.php/AAAI/article/view/25386/25158) [![Journal](http://img.shields.io/badge/IJCV-2023-Bf107.svg)](https://link.springer.com/article/10.1007/s11263-023-01876-w) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/Text4Vis) > [**Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?**](https://arxiv.org/abs/2301.00184)
> Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
> Accepted by CVPR 2023 as 🌟Highlight🌟 | [![Conference](http://img.shields.io/badge/CVPR-2023-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Cap4Video_What_Can_Auxiliary_Captions_Do_for_Text-Video_Retrieval_CVPR_2023_paper.html) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/Cap4Video)

News

Overview

🚴BIKE explores bidirectional cross-modal knowledge from the pre-trained vision-language model (e.g., CLIP) to introduce auxiliary attributes and category-dependent temporal saliency for improved video recognition.

BIKE

Content

Prerequisites

The code is built with following libraries.

- [PyTorch](https://pytorch.org/) >= 1.8 - RandAugment - pprint - tqdm - dotmap - yaml - csv - Optional: decord (for on-the-fly video training) - Optional: torchnet (for mAP evaluation on ActivityNet)

Data Preparation

Video Loader

(Recommend) To train all of our models, we extract videos into frames for fast reading. Please refer to MVFNet repo for the detailed guide of dataset processing.
The annotation file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace.

Example of annotation ```sh abseiling/-7kbO0v4hag_000107_000117 300 0 abseiling/-bwYZwnwb8E_000013_000023 300 0 ```

(Optional) We can also decode the videos in an online fashion using decord. This manner should work but are not tested. All of the models offered have been trained using offline frames.

Example of annotation ```sh abseiling/-7kbO0v4hag_000107_000117.mp4 0 abseiling/-bwYZwnwb8E_000013_000023.mp4 0 ```

Annotation

Annotation information consists of two parts: video label, and category description.

📱 Model Zoo

Training GPU Memory

During our training, we maintain a total batch size of 256. If your machine's GPU memory is not enough, you can reduce the batch size to reduce the memory usage. It is best to correspondingly increase the value of "grad_accumulation_steps" in the config file.

Architecture Input Batch Size (per GPU) x 8GPUs Mem. (per GPU)
ViT-B/32 8x2242 32 x 8 = 256 6G
ViT-B/16 8x2242 32 x 8 = 256 9G
ViT-L/14 8x2242 32 x 8 = 256 18G
ViT-L/14 16x2242 32 x 8 = 256 29G

Kinetics-400

Architecture Input Views Top-1(%) checkpoint Train log config
ViT-B/32 V+A 8x2242 1x1 81.4 Score log config
ViT-B/16 8x2242 4x3 84.0 Github log config
ViT-L/14* 8x2242 4x3 87.4 OneDrive log config
ViT-L/14 16x2242 4x3 88.1 OneDrive log config
ViT-L/14 8x3362 4x3 88.3 OneDrive log config
ViT-L/14 16x3362 4x3 88.7 OneDrive log config
<!-- ViT-L/14 32x3362 4x3 xx.x [OneDrive]() log config -->

Untrimmed Video Recognition: ActivityNet

Architecture Input Views Top-1 (%) mAP (%) checkpoint Train log config
ViT-L/14 16x2242 4x1 94.4 96.3 OneDrive log config
ViT-L/14 16x3362 4x1 94.7 96.1 OneDrive log config

Multi-label Action Recognition: Charades

Architecture Input Views mAP (%) checkpoint Train log config
ViT-L/14 16x3362 4x1 50.7 OneDrive log config

UCF-101

Architecture Input Views Top-1 (%) checkpoint Train log config
ViT-L/14 16x2242 1x1 98.7 OneDrive log config
ViT-L/14 16x3362 1x1 98.9 OneDrive log config

HMDB-51

Architecture Input Views Top-1 (%) checkpoint Train log config
ViT-L/14 16x2242 1x1 82.9 OneDrive log config
ViT-L/14 16x3362 1x1 84.3 OneDrive log config

🚀 Training

This implementation supports Multi-GPU DistributedDataParallel training, which is faster and simpler than DataParallel training. Note: The JSON file containing the attributes is already available at https://github.com/whwu95/BIKE/releases/tag/v1.0.

  1. Single Machine: To train our model on Kinetics-400 with 8 GPUs in Single Machine, you can run:
    
    # We train the 8 Frames ViT-B/32 video model (i.e., video branch).
    sh scripts/run_train.sh  configs/k400/k400_train_rgb_vitb-32-f8.yaml

We train the video branch and attributes branch.

sh scripts/run_co_train.sh configs/k400/k400_train_video_attr_vitb-32-f8.yaml


<details><summary>2. Mulitple Machines: We also provide the script to train larger model with Mulitple Machines (e.g., 2 nodes have 16 GPUs).</summary>

```sh
# For example, we train the 8 Frames ViT-L/14-336 with 2 machines as follows:
# For first machine, you need to set the ip of your first machine as the --master_addr, --nnodes is 2.
# Compared with the Single-Machine training script, only one node_id needs to be added.
sh scripts/run_train_multinodes.sh configs/k400/configs/k400/k400_train_rgb_vitl-14-336-f8.yaml 0

# For second machine, --master_addr is still the ip of your first machine
sh scripts/run_train_multinodes.sh configs/k400/configs/k400/k400_train_rgb_vitl-14-336-f8.yaml 1

3. Few-shot Recognition: To train our model under Few-shot scenario, you just need to add one line in the general config file. ```sh # You can refer to config/k400/k400_few_shot.yaml data: ... # general configurations shot: 2 # i.e., 2-shot setting ```

⚡ Testing

We support single-view validation (default) and multi-view (4x3 views) validation.

# The testing command for obtaining top-1/top-5 accuracy.
sh scripts/run_test.sh Your-Config.yaml Your-Trained-Model.pt

# The command for zero-shot evaluation is similar.
sh scripts/run_test_zeroshot.sh Your-Config.yaml Your-Trained-Model.pt

We provide more examples of testing commands below.

General / Few-shot Video Recognition ```sh # Efficient Setting: Single view evaluation. # E.g., ViT-L/14 8 Frames on Kinetics-400. You should get around 86.5% top-1 accuracy. sh scripts/run_test.sh configs/k400/k400_train_rgb_vitl-14-f8.yaml exps/k400/ViT-L/14/8f/k400-vit-l-14-f8.pt # Accurate Setting: Multi-view evalition (4clipsx3crops). # You should get around 87.4% top-1 accuracy. sh scripts/run_test.sh configs/k400/k400_train_rgb_vitl-14-f8.yaml exps/k400/ViT-L/14/8f/k400-vit-l-14-f8.pt --test_crops 3 --test_clips 4 # Test the Charades dataset using the mAP metric. You should achieve around 50.7 mAP. sh scripts/run_test_charades.sh configs/charades/charades_k400_finetune_336.yaml exps/charades/ViT-L/14-336px/16f/charades-vit-l-336-f16.pt --test_crops 1 --test_clips 4 # Test the ActivityNet dataset using top1 and mAP metric. You should achieve around 96.3 mAP. sh scripts/run_test.sh configs/anet/anet_k400_finetune.yaml exps/anet/ViT-L/14/f16/anet-vit-l-f16.pt --test_crops 1 --test_clips 4 ```
Zero-shot Evaluation

We use the Kinetics-400 pre-trained model (e.g., [ViT-L/14 with 8 frames](configs/k400/k400_train_rgb_vitl-14-f8.yaml)) to perform cross-dataset zero-shot evaluation, i.e., UCF101, HMDB51, ActivityNet, Kinetics-600. - Half-classes Evaluation: A traditional evaluation protocol involves selecting half of the test dataset's classes, repeating the process ten times, and reporting the mean accuracy with a standard deviation of ten times. - Full-classes Evaluation: Perform evaluation on the entire dataset. ```sh # On ActivityNet: reporting the half-classes and full-classes results # Half-classes: 86.18 ± 1.05, Full-classes: 80.04 sh scripts/run_test_zeroshot.sh configs/anet/anet_zero_shot.yaml exps/k400/ViT-L/14/8f/k400-vit-l-14-f8.pt # On UCF101: reporting the half-classes and full-classes results # Half-classes: 86.63 ± 3.4, Full-classes: 80.83 sh scripts/run_test_zeroshot.sh configs/ucf101/ucf_zero_shot.yaml exps/k400/ViT-L/14/8f/k400-vit-l-14-f8.pt # On HMDB51: reporting the half-classes and full-classes results # Half-classes: 61.37 ± 3.68, Full-classes: 52.75 sh scripts/run_test_zeroshot.sh configs/hmdb51/hmdb_zero_shot.yaml exps/k400/ViT-L/14/8f/k400-vit-l-14-f8.pt # On Kinetics-600: manually calculating the mean accuracy with standard deviation of three splits. # Split1: 70.14, Split2: 68.31, Split3: 67.15 # Average: 68.53 ± 1.23 sh scripts/run_test.sh configs/k600/k600_zero_shot_split1.yaml exps/k400/ViT-L/14/8f/k400-vit-l-14-f8.pt sh scripts/run_test.sh configs/k600/k600_zero_shot_split2.yaml exps/k400/ViT-L/14/8f/k400-vit-l-14-f8.pt sh scripts/run_test.sh configs/k600/k600_zero_shot_split3.yaml exps/k400/ViT-L/14/8f/k400-vit-l-14-f8.pt ```

📌 BibTeX & Citation

If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry😁.

@inproceedings{bike,
  title={Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models},
  author={Wu, Wenhao and Wang, Xiaohan and Luo, Haipeng and Wang, Jingdong and Yang, Yi and Ouyang, Wanli},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2023}
}

If you also find Text4Vis useful 😁, please cite the paper:

@article{text4vis,
  title={Revisiting Classifier: Transferring Vision-Language Models for Video Recognition},
  author={Wu, Wenhao and Sun, Zhun and Ouyang, Wanli},
  booktitle={Proceedings of AAAI Conference on Artificial Intelligence (AAAI)},
  year={2023}
}

🎗️ Acknowledgement

This repository is built based on Text4Vis, ActionCLIP, and CLIP. Sincere thanks to their wonderful works.

👫 Contact

For any question, please file an issue.