xian-sh / UniSDNet

MIT License
9 stars 1 forks source link

Unified Static and Dynamic:Temporal Filtering Network for Efficient Video Grounding

Jingjing Hu, Dan Guo, Kun Li, Zhan Si, Xun Yang, Xiaojun Chang and Meng Wang

Hefei University of Technology


Task Example: Video grounding task (query: text or audio). The video is described by four queries (events), all of which have separate semantic context and temporal dependency. Other queries can provide global context (antecedents and consequences) for the current query (e.g. query Q4). Besides, historical similar scenarios (such as in blue dashed box) help to discover relevant event clues (time and semantic clues) for understanding the current scenario (blue solid box).


The architecture of the UniSDNet. It mainly consists of static and dynamic networks: Static Semantic Supplement Network (S3Net) and Dynamic Temporal Filtering Network (DTFNet). S3Net concatenates video clips and multiple queries into a sequence and encodes them through a lightweight single-stream ResMLP network. DTFNet is a 2-layer graph network with a dynamic Gaussian filtering convolution mechanism, which is designed to control message passing between nodes by considering temporal distance and semantic relevance as the Gaussian filtering clues when updating node features. The role of 2D temporal map is to retain possible candidate proposals and represent them by aggregating the features of each proposal moment. Finally, we perform semantic matching between the queries and proposals and rank the best ones as the predictions.


To be updated

Download and prepare the datasets

1. Download the original datasets (optional).

2. Pre-extracted dataset features.


3. Prepare the files in the following structure.

  ├── configs
  ├── dataset
  ├── dtfnet
  ├── data
  │   ├── activitynet
  │   │   ├── *text features
  │   │   ├── *audio features
  │   │   └── *video c3d features
  │   ├── charades
  │   │   ├── *text features
  │   │   ├── *audio features
  │   │   ├── *video vgg features
  │   │   ├── *video c3d features
  │   │   └── *video i3d features
  │   └── tacos
  │       ├── *text features
  │       ├── *audio features
  │       └── *video c3d features
  ├── train_net.py
  ├── test_net.py
  └── ···

4. Or set your own dataset path in the following .py file.



pip install yacs h5py terminaltables tqdm librosa transformers
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
conda config --add channels pytorch
conda install pytorch-geometric -c rusty1s -c conda-forge


For training, run the python instruction below:

python train_net.py --config-file configs/xxxx.yaml 


Our trained model are provided in baiduyun, passcode:d4yl or Google Drive. Please download them to the checkpoints/best/ folder. Use the following commands for testing:

python test_net.py --config-file checkpoints/best/xxxx.yaml   --ckpt   checkpoints/best/xxxx.pth

Main NLVG Results:

ActivityNet Captions Rank1@0.5 Rank1@0.7 Rank5@0.5 Rank5@0.7 mIoU
UniSDNet 60.75 38.88 85.34 74.01 55.47

TACoS Rank1@0.3 Rank1@0.5 Rank5@0.3 Rank5@0.5 mIoU
UniSDNet 55.56 40.26 77.08 64.01 38.88

Charades-STA (VGG) Rank1@0.5 Rank1@0.7 Rank5@0.5 Rank5@0.7 mIoU
UniSDNet 48.41 28.33 84.76 59.46 44.41

Charades-STA (C3D) Rank1@0.5 Rank1@0.7 Rank5@0.5 Rank5@0.7 mIoU
UniSDNet 49.57 28.39 84.70 58.49 44.29

Charades-STA (I3D) Rank1@0.5 Rank1@0.7 Rank5@0.5 Rank5@0.7 mIoU
UniSDNet 61.02 39.70 89.97 73.20 52.69

Main SLVG Results:

ActivityNet Speech Rank1@0.3 Rank1@0.5 Rank1@0.7 Rank5@0.3 Rank5@0.5 Rank5@0.7 mIoU
UniSDNet 72.27 56.29 33.29 90.41 84.28 72.42 52.22

TACoS Speech Rank1@0.3 Rank1@0.5 Rank1@0.7 Rank5@0.3 Rank5@0.5 Rank5@0.7 mIoU
UniSDNet 51.66 37.77 20.44 76.38 63.48 33.64 36.86

Charades-STA Speech(VGG) Rank1@0.3 Rank1@0.5 Rank1@0.7 Rank5@0.3 Rank5@0.5 Rank5@0.7 mIoU
UniSDNet 60.73 46.37 26.72 92.66 82.31 57.66 42.28

Charades-STA (I3D) Rank1@0.3 Rank1@0.5 Rank1@0.7 Rank5@0.3 Rank5@0.5 Rank5@0.7 mIoU
UniSDNet 67.45 53.82 34.49 94.81 87.90 69.30 48.27


If you find the repository or the paper useful, please use the following entry for citation.

  title={Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding},
  author={Jingjing Hu and Dan Guo and Kun Li and Zhan Si and Xun Yang and Xiaojun Chang and Meng Wang},


If there are any questions, feel free to contact the author: Jingjing Hu (xianhjj623@gmail.com)


The annotation files and many parts of the implementations are borrowed from MMN. Our codes are under MIT license.