whwu95 / Text4Vis

【AAAI'2023 & IJCV】Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
MIT License
204 stars 15 forks source link
action-recognition cross-modal-learning transfer-learning video-recognition video-understanding

🔥【AAAI'2023, IJCV'2023】Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

[![Conference](http://img.shields.io/badge/AAAI-2023-f9f107.svg)](https://ojs.aaai.org/index.php/AAAI/article/view/25386/25158) [![Journal](http://img.shields.io/badge/IJCV-2023-Bf107.svg)](https://link.springer.com/article/10.1007/s11263-023-01876-w) [Wenhao Wu](https://whwu95.github.io/)1,2, [Zhun Sun](https://scholar.google.co.jp/citations?user=Y-3iZ9EAAAAJ&hl=en)2, [Wanli Ouyang](https://wlouyang.github.io/)3,1 1[The University of Sydney](https://www.sydney.edu.au/), 2[Baidu](https://vis.baidu.com/#/), 3[Shanghai AI Lab](https://www.shlab.org.cn/)

PWC PWC PWC PWC PWC PWC PWC

This is the official implementation of the AAAI paper Revisiting Classifier: Transferring Vision-Language Models for Video Recognition, and IJCV paper Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective.

🙋 I also have other cross-modal video projects that may interest you ✨.

> [**Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models**](https://arxiv.org/abs/2301.00182)
> Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
> [![Conference](http://img.shields.io/badge/CVPR-2023-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Bidirectional_Cross-Modal_Knowledge_Exploration_for_Video_Recognition_With_Pre-Trained_Vision-Language_CVPR_2023_paper.html) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/BIKE) > [**Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?**](https://arxiv.org/abs/2301.00184)
> Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
> Accepted by CVPR 2023 as 🌟Highlight🌟 | [![Conference](http://img.shields.io/badge/CVPR-2023-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Cap4Video_What_Can_Auxiliary_Captions_Do_for_Text-Video_Retrieval_CVPR_2023_paper.html) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/Cap4Video)

📣 Updates

🌈 Overview

In our Text4Vis, we revise the role of the linear classifier and replace the classifier with the different knowledge from pre-trained model. We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.

1
2

Content

📕 Prerequisites

The code is built with following libraries:

📚 Data Preparation

Video Loader

(Recommend) To train all of our models, we extract videos into frames for fast reading. Please refer to MVFNet repo for the detaied guide of data processing.
The annotation file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace. Here is the format:

abseiling/-7kbO0v4hag_000107_000117 300 0
abseiling/-bwYZwnwb8E_000013_000023 300 0

(Optional) We can also decode the videos in an online fashion using decord. This manner should work but are not tested. All of the models offered have been trained using offline frames. Example of annotation:

abseiling/-7kbO0v4hag_000107_000117.mp4 0
abseiling/-bwYZwnwb8E_000013_000023.mp4 0

Annotation

Annotation information consists of two parts: video label, and category description.

📱 Model Zoo

Here we provide some off-the-shelf pre-trained checkpoints of our models in the following tables.

#Frame = #input_frame x #spatial crops x #temporal clips

Kinetics-400

Architecture #Frame Top-1 Acc.(%) checkpoint Train log config
ViT-B/32 8x3x4 80.0 Github log config
ViT-B/32 16x3x4 80.5 Github log config
ViT-B/16 8x3x4 82.9 Github log config
ViT-B/16 16x3x4 83.6 Github log config
ViT-L/14* 8x3x4 86.4 OneDrive log config
ViT-L/14-336 8x3x4 87.1 OneDrive log config
ViT-L/14-336 32x3x1 87.8 OneDrive log config

Note: indicates that this ViT-L model is used for the zero-shot evaluation on UCF, HMDB, ActivityNet and Kinetics-600.*

ActivityNet

Architecture #Frame mAP (%) checkpoint Train log config
ViT-L/14 16x1x1 96.5 OneDrive config
ViT-L/14-336 16x1x1 96.9 OneDrive log config

UCF-101

Architecture #Frame Top-1 Acc. (%) checkpoint Train log config
ViT-L/14 16x1x1 98.1 OneDrive log config
<!-- ViT-L/14-336 16x1x1 98.2 - log config -->

HMDB-51

Architecture #Frame Top-1 Acc. (%) checkpoint Train log config
ViT-L/14 16x1x1 81.3 OneDrive log config

🚀 Training

This implementation supports Multi-GPU DistributedDataParallel training, which is faster and simpler than DataParallel used in ActionCLIP.

For second machine, --master_addr is still the ip of your first machine

sh scripts/run_train_multinodes.sh configs/k400/k400_train_rgb_vitl-14-f8.yaml 1


- **Few-shot Recognition**: To train our model under *Few-shot* scenario, you just need to add one line in the general config file:
```sh
# You can refer to config/k400/k400_few_shot.yaml
data: 
    ...  # general configurations
    shot: 2  # i.e., 2-shot setting

⚡ Testing

We support single view validation and multi-view (4x3 views) validation.

General/Few-shot Video Recognition

# Single view evaluation. e.g., ViT-B/32 8 Frames on Kinetics-400
sh scripts/run_test.sh  configs/k400/k400_train_rgb_vitb-32-f8.yaml exp/k400/ViT-B/32/f8/last_model.pt

# Multi-view evalition (4clipsx3crops). e.g., ViT-B/32 8 Frames on Kinetics-400
sh scripts/run_test.sh  configs/k400/k400_train_rgb_vitb-32-f8.yaml exp/k400/ViT-B/32/f8/last_model.pt --test_crops 3  --test_clips 4

Zero-shot Evaluation

We use the Kinetics-400 pre-trained model (e.g., ViT-L/14 with 8 frames) to perform cross-dataset zero-shot evaluation, i.e., UCF101, HMDB51, ActivityNet, Kinetics-600.

# On ActivityNet: reporting the half-classes and full-classes results
sh scripts/run_test_zeroshot.sh  configs/anet/anet_zero_shot.yaml exp/k400/ViT-L/14/f8/last_model.pt

# On UCF101: reporting the half-classes and full-classes results
sh scripts/run_test_zeroshot.sh  configs/ucf101/ucf_zero_shot.yaml exp/k400/ViT-L/14/f8/last_model.pt

# On HMDB51: reporting the half-classes and full-classes results
sh scripts/run_test_zeroshot.sh  configs/hmdb51/hmdb_zero_shot.yaml exp/k400/ViT-L/14/f8/last_model.pt

# On Kinetics-600: manually calculating the mean accuracy with standard deviation of three splits.
sh scripts/run_test.sh  configs/k600/k600_zero_shot_split1.yaml exp/k400/ViT-L/14/f8/last_model.pt
sh scripts/run_test.sh  configs/k600/k600_zero_shot_split2.yaml exp/k400/ViT-L/14/f8/last_model.pt
sh scripts/run_test.sh  configs/k600/k600_zero_shot_split3.yaml exp/k400/ViT-L/14/f8/last_model.pt

📌 BibTeX & Citation

If you find this repository useful, please star🌟 this repo and cite📑 our paper:

@inproceedings{wu2023revisiting,
  title={Revisiting classifier: Transferring vision-language models for video recognition},
  author={Wu, Wenhao and Sun, Zhun and Ouyang, Wanli},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={37},
  number={3},
  pages={2847--2855},
  year={2023}
}

@article{wu2023transferring,
  title={Transferring vision-language models for visual recognition: A classifier perspective},
  author={Wu, Wenhao and Sun, Zhun and Song, Yuxin and Wang, Jingdong and Ouyang, Wanli},
  journal={International Journal of Computer Vision},
  pages={1--18},
  year={2023},
  publisher={Springer}
}

If you also find BIKE useful, please cite the paper:

@inproceedings{bike,
  title={Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models},
  author={Wu, Wenhao and Wang, Xiaohan and Luo, Haipeng and Wang, Jingdong and Yang, Yi and Ouyang, Wanli},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}

🎗️ Acknowledgement

This repository is built based on ActionCLIP and CLIP. Sincere thanks to their wonderful works.

👫 Contact

For any question, please file an issue.