whwu95 / Cap4Video

【CVPR'2023 Highlight & TPAMI】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
https://arxiv.org/abs/2301.00184
MIT License
225 stars 16 forks source link
cross-modal-learning video-language-understanding video-text-retrieval video-understanding

【CVPR'2023 🌟Highlight🌟 & TPAMI】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

[![Conference](http://img.shields.io/badge/CVPR-2023(Highlight)-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Cap4Video_What_Can_Auxiliary_Captions_Do_for_Text-Video_Retrieval_CVPR_2023_paper.html) [![arXiv](https://img.shields.io/badge/Arxiv-2311.15732-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2301.00184) [Wenhao Wu](https://whwu95.github.io/)1,2, [Haipeng Luo]()3, [Bo Fang](https://bofang98.github.io/)3, [Jingdong Wang](https://jingdongwang2017.github.io/)2, [Wanli Ouyang](https://wlouyang.github.io/)4,1 1[The University of Sydney](https://www.sydney.edu.au/), 2[Baidu](https://vis.baidu.com/#/), 3[UCAS](https://english.ucas.ac.cn/), 4[Shanghai AI Lab](https://www.shlab.org.cn/)

PWC PWC PWC PWC

Welcome to the official implementation of Cap4Video - an innovative framework that maximizes the utility of auxiliary captions generated by powerful LLMs (e.g., GPT) to enhance video-text matching.

📣 I also have other cross-modal video projects that may interest you ✨.

> [**Revisiting Classifier: Transferring Vision-Language Models for Video Recognition**](https://arxiv.org/abs/2207.01297)
> Wenhao Wu, Zhun Sun, Wanli Ouyang
> [![Conference](http://img.shields.io/badge/AAAI-2023-f9f107.svg)](https://ojs.aaai.org/index.php/AAAI/article/view/25386/25158) [![Journal](http://img.shields.io/badge/IJCV-2023-Bf107.svg)](https://link.springer.com/article/10.1007/s11263-023-01876-w) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/Text4Vis) > [**Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models**](https://arxiv.org/abs/2301.00182)
> Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
> [![Conference](http://img.shields.io/badge/CVPR-2023-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Bidirectional_Cross-Modal_Knowledge_Exploration_for_Video_Recognition_With_Pre-Trained_Vision-Language_CVPR_2023_paper.html) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/BIKE)

News

Overview

Cap4Video leverages captions generated by large language models to improve text-video matching in three ways: (1) input data augmentation during training, (2) intermediate video-caption feature interaction for creating compact video representations, and (3) output score fusion for enhancing text-video matching. Cap4Video is compatible with both global and fine-grained matching.

Cap4Video

Requirement

# From CLIP
conda install --yes -c pytorch pytorch=1.8.1 torchvision cudatoolkit=11.1
pip install ftfy regex tqdm
pip install opencv-python boto3 requests pandas

Data Preparing

All video datasets can be downloaded from respective official links. In order to improve training efficiency, we have preprocessed these videos into frames, which we have packaged and uploaded for convenient reproduction of our results.

Dataset Official Link Ours
MSRVTT Video Frames
DiDeMo Video Video
MSVD Video Frames
VATEX Video Frames

How to Run

Visualization

The text-video results on the MSR-VTT 1K-A test set. Left: The ranking results of the query-video matching model. Right: The ranking results of Cap4Video, which incorporates generated captions to enhance retrieval.

📌 BibTeX & Citation

If you use our code in your research or wish to refer to the results, please star 🌟 this repo and use the following BibTeX 📑 entry.

@inproceedings{cap4video,
  title={Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?},
  author={Wu, Wenhao and Luo, Haipeng and Fang, Bo and Wang, Jingdong and Ouyang, Wanli},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2023}
}

🎗️ Acknowledgement

This repository is built in part on the excellent works of CLIP4Clip and DRL. We use Video ZeroCap to pre-extract captions from the videos. We extend our sincere gratitude to these contributors for their invaluable contributions.

👫 Contact

For any questions, please feel free to file an issue.