Jingjing Hu, Dan Guo, Kun Li, Zhan Si, Xun Yang, Xiaojun Chang and Meng Wang
Hefei University of Technology
Task Example: Video grounding task (query: text or audio). The video is described by four queries (events), all of which have separate semantic context and temporal dependency. Other queries can provide global context (antecedents and consequences) for the current query (e.g. query Q4). Besides, historical similar scenarios (such as in blue dashed box) help to discover relevant event clues (time and semantic clues) for understanding the current scenario (blue solid box).
The architecture of the UniSDNet. It mainly consists of static and dynamic networks: Static Semantic Supplement Network (S3Net) and Dynamic Temporal Filtering Network (DTFNet). S3Net concatenates video clips and multiple queries into a sequence and encodes them through a lightweight single-stream ResMLP network. DTFNet is a 2-layer graph network with a dynamic Gaussian filtering convolution mechanism, which is designed to control message passing between nodes by considering temporal distance and semantic relevance as the Gaussian filtering clues when updating node features. The role of 2D temporal map is to retain possible candidate proposals and represent them by aggregating the features of each proposal moment. Finally, we perform semantic matching between the queries and proposals and rank the best ones as the predictions.
1. Download the original datasets (optional).
The video feature provided by 2D-TAN
ActivityNet Captions C3D feature
Charades-STA VGG feature
TACoS C3D feature
The video I3D feature of Charades-STA dataset from LGI
wget http://cvlab.postech.ac.kr/research/LGI/charades_data.tar.gz
tar zxvf charades_data.tar.gz
mv charades data
rm charades_data.tar.gz
The video C3D feature of Charades-STA dataset from DRN
https://pan.baidu.com/s/1Sn0GYpJmiHa27m9CAN12qw
password:smil
The Audio Captions: ActivityNet Speech Dataset: download the original audio proposed by VGCL
The Audio Captions: Charades-STA Speech Dataset: download the original audio proposed by us.
The Audio Captions: TACoS Speech Dataset: download the original audio proposed by us.
2. Pre-extracted dataset features.
https://pan.baidu.com/xxxx
password:xxxx
3. Prepare the files in the following structure.
UniSDNet
├── configs
├── dataset
├── dtfnet
├── data
│ ├── activitynet
│ │ ├── *text features
│ │ ├── *audio features
│ │ └── *video c3d features
│ ├── charades
│ │ ├── *text features
│ │ ├── *audio features
│ │ ├── *video vgg features
│ │ ├── *video c3d features
│ │ └── *video i3d features
│ └── tacos
│ ├── *text features
│ ├── *audio features
│ └── *video c3d features
├── train_net.py
├── test_net.py
└── ···
4. Or set your own dataset path in the following .py file.
dtfnet/config/paths_catalog.py
pip install yacs h5py terminaltables tqdm librosa transformers
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
conda config --add channels pytorch
conda install pytorch-geometric -c rusty1s -c conda-forge
For training, run the python instruction below:
python train_net.py --config-file configs/xxxx.yaml
Our trained model are provided in baiduyun, passcode:d4yl or Google Drive. Please download them to the checkpoints/best/
folder.
Use the following commands for testing:
python test_net.py --config-file checkpoints/best/xxxx.yaml --ckpt checkpoints/best/xxxx.pth
ActivityNet Captions | Rank1@0.5 | Rank1@0.7 | Rank5@0.5 | Rank5@0.7 | mIoU |
---|---|---|---|---|---|
UniSDNet | 60.75 | 38.88 | 85.34 | 74.01 | 55.47 |
TACoS | Rank1@0.3 | Rank1@0.5 | Rank5@0.3 | Rank5@0.5 | mIoU |
---|---|---|---|---|---|
UniSDNet | 55.56 | 40.26 | 77.08 | 64.01 | 38.88 |
Charades-STA (VGG) | Rank1@0.5 | Rank1@0.7 | Rank5@0.5 | Rank5@0.7 | mIoU |
---|---|---|---|---|---|
UniSDNet | 48.41 | 28.33 | 84.76 | 59.46 | 44.41 |
Charades-STA (C3D) | Rank1@0.5 | Rank1@0.7 | Rank5@0.5 | Rank5@0.7 | mIoU |
---|---|---|---|---|---|
UniSDNet | 49.57 | 28.39 | 84.70 | 58.49 | 44.29 |
Charades-STA (I3D) | Rank1@0.5 | Rank1@0.7 | Rank5@0.5 | Rank5@0.7 | mIoU |
---|---|---|---|---|---|
UniSDNet | 61.02 | 39.70 | 89.97 | 73.20 | 52.69 |
ActivityNet Speech | Rank1@0.3 | Rank1@0.5 | Rank1@0.7 | Rank5@0.3 | Rank5@0.5 | Rank5@0.7 | mIoU |
---|---|---|---|---|---|---|---|
UniSDNet | 72.27 | 56.29 | 33.29 | 90.41 | 84.28 | 72.42 | 52.22 |
TACoS Speech | Rank1@0.3 | Rank1@0.5 | Rank1@0.7 | Rank5@0.3 | Rank5@0.5 | Rank5@0.7 | mIoU |
---|---|---|---|---|---|---|---|
UniSDNet | 51.66 | 37.77 | 20.44 | 76.38 | 63.48 | 33.64 | 36.86 |
Charades-STA Speech(VGG) | Rank1@0.3 | Rank1@0.5 | Rank1@0.7 | Rank5@0.3 | Rank5@0.5 | Rank5@0.7 | mIoU |
---|---|---|---|---|---|---|---|
UniSDNet | 60.73 | 46.37 | 26.72 | 92.66 | 82.31 | 57.66 | 42.28 |
Charades-STA (I3D) | Rank1@0.3 | Rank1@0.5 | Rank1@0.7 | Rank5@0.3 | Rank5@0.5 | Rank5@0.7 | mIoU |
---|---|---|---|---|---|---|---|
UniSDNet | 67.45 | 53.82 | 34.49 | 94.81 | 87.90 | 69.30 | 48.27 |
If you find the repository or the paper useful, please use the following entry for citation.
@article{hu2024unified,
title={Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding},
author={Jingjing Hu and Dan Guo and Kun Li and Zhan Si and Xun Yang and Xiaojun Chang and Meng Wang},
year={2024},
Journal={CoRR},
volume={abs/2403.14174},
}
If there are any questions, feel free to contact the author: Jingjing Hu (xianhjj623@gmail.com)
The annotation files and many parts of the implementations are borrowed from MMN. Our codes are under MIT license.