[Paper] [Checkpoint] [Dataset]
hawkeye.pth
. Now only ckpts of vicuna-7b-v-0 and hawkeye.pth
are needed to load Hawkeye.Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos. However, they perform almost the same as random on grounding text queries in long and complicated videos, having little ability to understand and reason about temporal information, which is the most fundamental difference between videos and images.
We propose HawkEye, one of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner. To collect training data that is applicable for temporal video grounding, we construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans, with which we introduce two new time-aware training objectives to video-text LLMs. We also propose a coarse-grained method of representing segments in videos, which is more robust and easier for LLMs to learn and follow than other alternatives.
We release our HawkEye and our impl. VideoChat2 Model Checkpoints, and InternVid-G Dataset at π€HuggingFace.
Live Demo In progress
You can use demo.ipynb
to test HawkEye on your data.
model/
for model checkpoints: mkdir model/
After downloading all model checkpoints, the model/
folder should looks like this:
βββ hawkeye.pth
βββ vicuna-7b-v0/
βββ VideoChat2/ (optional)
βββ umt_l16_qformer.pth
βββ videochat2_7b_stage2.pth
Download from Dataset Homepage at π€HuggingFace, and save in data/HawkEye-IT/
folder. We also provide data proessing code in data_preparing/
, you can use it for reference.
Note that you also need to download the videos of each dataset from their original links, which is further explained in dataset homepage (this may take quite a while π). Use soft links to link the video folder under data/videos/
.
After data preparation, the data/
folder should looks like this:
βββ HawkEye-IT/
βββ image/ # inherited from VideoChat2-IT, but not used in training HawkEye
βββ video/
βββ temporal/
βββ internvid_grounding/, charades_sta_grounding/, anetc_grounding/
βββ instructions.json, questions.json, train.json
βββ internvid_caption/
βββ instructions.json, train.json
βββ caption/, classification/, conversation/, vqa/, reasoning/
βββ videos/
βββ internvid-g/, clevrer/, webvid/, activitynet/, tgif/,
βββ nextqa/, textvr/, youcook2/, kinetics/, ssv2/, charades/
Note that image/, caption/, classification/, conversation/, vqa/, reasoning/
folders of HawkEye-IT are identical to VideoChat2-IT.
bash ./scripts/train/run_7b_stage3.sh OUTPUT_PATH
The instruction-tuned HawkEye checkpoint will be saved in OUTPUT_PATH/ckpt_${ckpt}.pth
, where ${ckpt}
is the number of iterations you train.
Check the script to ensure the hyperparameters fit your computing device.
We also provide the scripts to finetune on Charades-STA and ActivityNet-Captions:
# IT_CKPT: the instruction-tuned HawkEye checkpoint
bash ./scripts/train/charades_sta.sh OUTPUT_PATH IT_CKPT
bash ./scripts/train/anetc.sh OUTPUT_PATH IT_CKPT
Check the script to ensure the hyperparameters fit your computing device.
Download MVBench and save in data/MVBench/
folder.
Download the annotation of other benchmarks from Google Drive and unzip to data/test-anno/
. We also provide data proessing code in data_preparing/
, you can use it for reference.
Download TVQA videos and link it at data/videos/tvqa
After downloading all benchmarks, the data/
folder should like this:
βββ HawkEye-IT/ # instruct tuning datasets
βββ MVBench/
βββ test-anno/
βββ charades_sta-recursive_grounding.json, anetc-recursive_grounding.json
βββ nextgqa-recursive_grounding.json
βββ nextqa-test.json, tvqa-test.json, star-test.json
βββ videos/
βββ nextqa/, tvqa/, charades/, activitynet/, ...
bash ./scripts/test/videoqa.sh
refer to data_preparing/videoqa.py
to convert the model outputs to the format required by STAR evaluation and TVQA evaluation w/ ts.
bash ./scripts/test/recursive_grounding.sh
To analyze the results of each recursive grounding step, refer to data_preparing/check_grounding_results.ipynb
.
If you find this code useful in your research, please consider citing:
@misc{wang2024hawkeye,
title={HawkEye: Training Video-Text LLMs for Grounding Text in Videos},
author={Yueqian Wang and Xiaojun Meng and Jianxin Liang and Yuxuan Wang and Qun Liu and Dongyan Zhao},
year={2024},
eprint={2403.10228},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This project is based on VideoChat and VideoChat2. Thanks for their great work!