We have released MMAction, a full-fledged action understanding toolbox based on PyTorch. It includes implementation for TSN as well as other STOA frameworks for various tasks. We highly recommend you switch to it. This repo will keep on being suppported for Caffe users.
This repository holds the codes and models for the papers
Temporal Segment Networks for Action Recognition in Videos, Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool, TPAMI, 2018.
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool, ECCV 2016, Amsterdam, Netherlands.
Jul. 20, 2018 - For those having trouble building the TSN toolkit, we have provided a built docker image you can use. Download it from DockerHub. It contains OpenCV, Caffe, DenseFlow, and this codebase. All built and ready to use with NVIDIA-Docker
Sep. 8, 2017 - We released TSN models trained on the Kinetics dataset with 76.6% single model top-1 accuracy. Find the model weights and transfer learning experiment results on the website.
Aug 10, 2017 - An experimental pytorch implementation of TSN is released github
Nov. 5, 2016 - The project page for TSN is online. website
Sep. 14, 2016 - We fixed a legacy bug in Caffe. Some parameters in TSN training are affected. You are advised to update to the latest version.
Below is the guidance to reproduce the reported results and explore more.
There are a few dependencies to run the code. The major libraries we use are
The codebase is written in Python. We recommend the Anaconda Python distribution. Matlab scripts are provided for some critical steps like video-level testing.
The most straightforward method to install these libraries is to run the build-all.sh
script.
Besides software, GPU(s) are required for optical flow extraction and model training. Our Caffe modification supports highly efficient parallel training. Just throw in as many GPUs as you like and enjoy.
Use git to clone this repository and its submodules
git clone --recursive https://github.com/yjxiong/temporal-segment-networks
Then run the building scripts to build the libraries.
bash build_all.sh
It will build Caffe and dense_flow. Since we need OpenCV to have Video IO, which is absent in most default installations, it will also download and build a local installation of OpenCV and use its Python interfaces.
Note that to run training with multiple GPUs, one needs to enable MPI support of Caffe. To do this, run
MPI_PREFIX=<root path to openmpi installation> bash build_all.sh MPI_ON
We experimented on two mainstream action recognition datasets: UCF-101 and HMDB51. Videos can be downloaded directly from their websites.
After download, please extract the videos from the rar
archives.
unrar x UCF101.rar
to extract the videos.mkdir rars && mkdir videos
unrar x hmdb51-org.rar rars/
for a in $(ls rars); do unrar x "rars/${a}" videos/; done;
We provided the trained model weights in Caffe style, consisting of specifications in Protobuf messages, and model weights. In the codebase we provide the model spec for UCF101 and HMDB51. The model weights can be downloaded by running the script
bash scripts/get_reference_models.sh
To run the training and testing, we need to decompose the video into frames. Also the temporal stream networks need optical flow or warped optical flow images for input.
These can be achieved with the script scripts/extract_optical_flow.sh
. The script has three arguments
SRC_FOLDER
points to the folder where you put the video datasetOUT_FOLDER
points to the root folder where the extracted frames and optical images will be put inNUM_WORKER
specifies the number of GPU to use in parallel for flow extraction, must be larger than 1The command for running optical flow extraction is as follows
bash scripts/extract_optical_flow.sh SRC_FOLDER OUT_FOLDER NUM_WORKER
It will take from several hours to several days to extract optical flows for the whole datasets, depending on the number of GPUs.
To help reproduce the results reported in the paper, we provide reference models trained by us for instant testing. Please use the following command to get the reference models.
bash scripts/get_reference_models.sh
We provide a Python framework to run the testing. For the benchmark datasets, we will measure average accuracy on the testing splits. We also provide the facility to analyze a single video.
Generally, to test on the benchmark dataset, we can use the scripts eval_net.py
and eval_scores.py
.
For example, to test the reference rgb stream model on split 1 of ucf 101 with 4 GPUs, run
python tools/eval_net.py ucf101 1 rgb FRAME_PATH \
models/ucf101/tsn_bn_inception_rgb_deploy.prototxt models/ucf101_split_1_tsn_rgb_reference_bn_inception.caffemodel \
--num_worker 4 --save_scores SCORE_FILE
where FRAME_PATH
is the path you extracted the frames of UCF-101 to and SCORE_FILE
is the filename to store the extracted scores.
One can also use cached score files to evaluate the performance. To do this, issue the following command
python tools/eval_scores.py SCORE_FILE
The more important function of eval_scores.py
is to do modality fusion.
For example, once we got the scores of rgb stream in RGB_SCORE_FILE
and flow stream in FLOW_SCORE_FILE
.
The fusion result with weights of 1:1.5
can be achieved with
python tools/eval_scores.py RGB_SCORE_FILE FLOW_SCORE_FILE --score_weights 1 1.5
To view the full help message of these scripts, run python eval_net.py -h
or python eval_scores.py -h
.
Training TSN is straightforward. We have provided the necessary model specs, solver configs, and initialization models. To achieve optimal training speed, we strongly advise you to turn on the parallel training support in the Caffe toolbox using following build command
MPI_PREFIX=<root path to openmpi installation> bash build_all.sh MPI_ON
where root path to openmpi installation
points to the installation of the OpenMPI, for example /usr/local/openmpi/
.
The data feeding in training relies on VideoDataLayer
in Caffe.
This layer uses a list file to specify its data sources.
Each line of the list file will contain a tuple of extracted video frame path, video frame number, and video groundtruth class.
A list file looks like
video_frame_path 100 10
video_2_frame_path 150 31
...
To build the file lists for all 3 splits of the two benchmark dataset, we have provided a script. Just use the following command
bash scripts/build_file_list.sh ucf101 FRAME_PATH
and
bash scripts/build_file_list.sh hmdb51 FRAME_PATH
The generated list files will be put in data/
with names like ucf101_flow_val_split_2.txt
.
We have built the initialization model weights for both rgb and flow input. The flow initialization models implements the cross-modality training technique in the paper. To download the model weights, run
bash scripts/get_init_models.sh
Once all necessities ready, we can start training TSN.
For this, use the script scripts/train_tsn.sh
.
For example, the following command runs training on UCF101 with rgb input
bash scripts/train_tsn.sh ucf101 rgb
the training will run with default settings on 4 GPUs. Usually, it takes around 1 hours to train the rgb model and 4 hours for flow models, on 4 GTX Titan X GPUs.
The learned model weights will be saved in models/
.
The aforementioned testing process can be used to evaluate them.
Here we provide some information on customizing the training process
models/ucf101/tsn_bn_inception_rgb_train_val.prototxt
.
On line 8, change
source: "data/ucf101_rgb_train_split_1.txt"`
to
`source: "data/ucf101_rgb_train_split_2.txt"`
On line 34, change
source: "data/ucf101_rgb_val_split_1.txt"
to
source: "data/ucf101_rgb_val_split_2.txt"
Also, in the solver file models/ucf101/tsn_bn_inception_rgb_solver.prototxt
, on line 12 change
snapshot_prefix: "models/ucf101_split1_tsn_rgb_bn_inception"
to
snapshot_prefix: "models/ucf101_split2_tsn_rgb_bn_inception"
in order to distiguish the learned weights.
N_GPU
in scripts/train_tsn.sh
.
Important notice: when the GPU number is changed, the effective batchsize is also changed.
It's better to always make sure the effective batchsize, which equals to batch_size*iter_size*n_gpu
, to be 128.
Here, batch_size
is the number in the model's prototxt, for example line 9
in models/ucf101/tsn_bn_inception_rgb_train_val.protoxt
.Please cite the following paper if you feel this repository useful.
@inproceedings{TSN2016ECCV,
author = {Limin Wang and
Yuanjun Xiong and
Zhe Wang and
Yu Qiao and
Dahua Lin and
Xiaoou Tang and
Luc {Val Gool}},
title = {Temporal Segment Networks: Towards Good Practices for Deep Action Recognition},
booktitle = {ECCV},
year = {2016},
}
For any question, please contact
Yuanjun Xiong: yjxiong@ie.cuhk.edu.hk
Limin Wang: lmwang.nju@gmail.com