HOLLYWOOD2: Actions in Context (CVPR 2009) [Paper][Homepage] 12 classes of human actions, 10 classes of scenes, 3,669 clips, 69 movies
HMDB: A Large Video Database for Human Motion Recognition (ICCV 2011) [Paper][Homepage] 51 classes, 7,000 clips
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild [Paper][Homepage] 101 classes, 13k clips
Sports-1M: Large-scale Video Classification with Convolutional Neural Networks [Paper][Homepage] 1,000,000 videos, 487 classes
ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding (CVPR 2015) [Paper][Homepage] 203 classes, 137 untrimmed videos per class, 1.41 activity instances per video
MPII-Cooking: Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data (IJCV 2015) [Paper][Homepage] 67 fine-grained activities, 59 composite activities, 14,105 clips, 273 videos
Kinetics [Kinetics-400/Kinetics-600/Kinetics-700/Kinetics-700-2020] [Homepage] 400/600/700/700 classes, at least 400/600/600/700 clips per class
Charades: Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ECCV 2016) [Paper][Homepage] 9,848 annotated videos, 267 people, 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes
Toyota Smarthome: Real-World Activities of Daily Living (ICCV 2019) [Paper][Homepage] 16,115 short RGB+D video samples, 31 activities, 3 modalities: RGB + Depth + 3D Skeleton, the subjects are senior people in the age range 60-80 years old
Charades-Ego: Actor and Observer: Joint Modeling of First and Third-Person Videos (CVPR 2018) [Paper][Homepage] 112 people, 4000 paired videos, 157 action classes
20BN-jester: The Jester Dataset: A Large-Scale Video Dataset of Human Gestures (ICCVW 2019) [Paper][Homepage] 148,092 videos, 27 classes, 1376 actors
Moments in Time Dataset: one million videos for event understanding (TPAMI 2019) [Paper][Homepage] over 1,000,000 labelled videos for 339 Moment classes, the average number of labeled videos per class is 1,757 with a median of 2,775
Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding [Paper][Homepage] 1.02 million videos, 313 action classes, 553,535 videos are annotated with more than one label and 257,491 videos are annotated with three or more labels
20BN-SOMETHING-SOMETHING: The "something something" video database for learning and evaluating visual common sense [Paper][Homepage] 100,000 videos across 174 classes
EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020) [Paper][Homepage] 100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes
HOMAGE: Home Action Genome: Cooperative Compositional Action Understanding (CVPR 2021) [[Paper]()][Homepage] 27 participants, 12 sensor types, 75 activities, 453 atomic actions, 1,752 synchronized sequences, 86 object classes, 29 relationship classes, 497,534 bounding boxes, 583,481 relationships
MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding (ICCV 2019) [Paper][Homepage] 36k video clips, 37 action classes, RGB+Keypoints+Acc+Gyo+Ori+Wi-Fi+Presure
LEMMA: A Multi-view Dataset for LEarning Multi-agent Multi-task Activities (ECCV 2020) [Paper][Homepage] RGB-D, 641 action classes, 11,781 action segments, 4.6M frames
NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis (CVPR 2016, TPAMI 2019) [Paper][Homepage] 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames, 120 action classes
Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs (CVPR 2020) [Paper][Homepage] 10K videos, 0.4M objects, 1.7M visual relationships
TITAN: Future Forecast using Action Priors (CVPR 2020) [Paper][Homepage] 700 labeled video-clips, 50 labels including vehicle states and actions, pedestrian age groups, and targeted pedestrian action attributes
PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding (ACM Multimedia Workshop) [Paper][Homepage] 1,076 long video sequences, 51 action categories, performed by 66 subjects in three camera views, 20,000 action instances, 5.4 million frames, RGB+depth+Infrared Radiation+Skeleton
HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization [Paper][Homepage] HACS Clips: 1.5M annotated clips sampled from 504K untrimmed videos, HACS Segments: 139K action segments densely annotated in 50K untrimmed videos spanning 200 action categories
Oops!: Predicting Unintentional Action in Video (CVPR 2020) [Paper][Homepage] 20,338 videos, 7,368 annotated for training, 6,739 annotated for testing
RareAct: A video dataset of unusual interactions [Paper][Homepage] 122 different actions, 7,607 clips, 905 videos, 19 verbs, 38 nouns
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding (CVPR 2020) [Paper][Homepage] 10 event categories, including 6 male events and 4 female events, 530 element categories
THUMOS: The THUMOS challenge on action recognition for videos “in the wild” [Paper][Homepage] 101 actions, train: 13,000 temporally trimmed videos, validation: 2100 temporally untrimmed videos with temporal annotations of actions, background: 3000 relevant videos, test: 5600 temporally untrimmed videos with withheld ground truth
MultiTHUMOS: Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos (IJCV 2017) [Paper][Homepage] 400 videos, 38,690 annotations of 65 action classes, 10.5 action classes per video
TinyVIRAT: Low-resolution Video Action Recognition [Paper][Homepage] 12,829 low-resolution videos, 26 classes, multi-label classification
UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles [Paper][Homepage] 67,428 multi-modal video sequences and 119 subjects for action recognition, 22,476 frames for pose estimation, 41,290 frames and 1,144 identities for person re-identification, and 22,263 frames for attribute recognition
Mimetics: Towards Understanding Human Actions Out of Context [Paper][Homepage] The Mimetics dataset contains 713 video clips from YouTube of mimed actions for a subset of 50 classes from the Kinetics400 dataset. It allows to evaluate on out-of-context human actions methods that have been trained on Kinetics.
HAA500: Human-Centric Atomic Action Dataset with Curated Videos (ICCV 2021) [Paper][Homepage] a manually annotated humancentric atomic action dataset for action recognition on 500 classes with over 591K labeled frames
CDAD: A Common Daily Action Dataset with Collected Hard Negative Samples (CVPR 2022) [Paper][Homepage] 57,824 video clips of 23 well-defined common daily actions with rich manual annotations
Hierarchical Action Search: Searching for Actions on the Hyperbole (CVPR 2020) [Paper][Homepage] Hierarchical-ActivityNet, Hierarchical-Kinetics, and Hierarchical-Moments from ActivityNet, mini-Kinetics, and Moments-in-time; provide action hierarchies and action splits for unseen action search
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis (CVPR 2019) [Paper][Homepage] 11,827 videos, 180 tasks, 12 domains, 46,354 annotated segments
VideoLT: Large-scale Long-tailed Video Recognition [Paper][Homepage] 256,218 untrimmed videos, annotated into 1,004 classes with a long-tailed distribution
Youtube-8M: A Large-Scale Video Classification Benchmark [Paper][Homepage] 8,000,000 videos, 4000 visual entities
HVU: Large Scale Holistic Video Understanding (ECCV 2020) [Paper][Homepage] 572k videos in total with 9 million annotations for training, validation and test set spanning over 3142 labels, semantic aspects defined on categories of scenes, objects, actions, events, attributes and concepts
VLOG: From Lifestyle Vlogs to Everyday Interactions (CVPR 2018) [Paper][Homepage] 114K video clips, 10.7K participants, Annotations: Hand/Semantic Object, Hand Contact State, Scene Classification
EEV: A Large-Scale Dataset for Studying Evoked Expressions from Video [Paper][Homepage] Each video is annotated at 6 Hz with 15 continuous evoked expression labels, 36.7 million annotations of viewer facial reactions to 23,574 videos (1,700 hours)
FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos (CVPR 2022) [Paper][Homepage] 38,935 video clips labeled with 7 classic expressions
OmniSource Web Dataset: Omni-sourced Webly-supervised Learning for Video Recognition (ECCV 2020) [Paper][Dataset] web data related to the 200 classes in Mini-Kinetics subset, 732,855 instagram videos, 365,4650 instagram images, 3,050,880 google images
Charades-Ego: Actor and Observer: Joint Modeling of First and Third-Person Videos (CVPR 2018) [Paper][Homepage] 112 people, 4000 paired videos, 157 action classes
100DOH: Understanding Human Hands in Contact at Internet Scale (CVPR 2020) [Paper][Homepage] 131 days of footage, 100K annotated hand-contact video frames
EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020) [Paper][Homepage] 100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes
HOMAGE: Home Action Genome: Cooperative Compositional Action Understanding (CVPR 2021) [Paper][Homepage] 27 participants, 12 sensor types, 75 activities, 453 atomic actions, 1,752 synchronized sequences, 86 object classes, 29 relationship classes, 497,534 bounding boxes, 583,481 relationships
Ego4D: Around the World in 3,000 Hours of Egocentric Video [Paper][Homepage] 3,025 hours of video collected by 855 unique participants from 74 worldwide locations in 9 different countries
DAVIS: A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation (CVPR 2016) [Paper][Homepage] 50 sequences, 3455 annotated frames
SegTrack v2: Video Segmentation by Tracking Many Figure-Ground Segments (ICCV 2013) [Paper][Homepage] 1,000 frames with pixel-level annotations
UVO: Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation [Paper][Homepage] 1200 videos, 108k frames, 12.29 objects per video
VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild (CVPR 2021) [Paper][Homepage] 3,536 videos, 251,632 pixel-level labeled frames, 124 categories, pixel-level annotations are provided at 15 f/s, a complete shot lasting 5 seconds on average
RGB-D in hand manipulation dataset: In-hand Object Scanning via RGB-D Video Segmentation (ICRA 2019) [Paper][Homepage] 13 sequences of in-hand manipulation of objects from the YCB dataset. Each sequence ranges from 300 to 700 frames in length (filmed at 30fps) and contains in-hand manipulation of the objects revealing all sides
ImageNet VID [Paper][Homepage] 30 categories, train: 3,862 video snippets, validation: 555 snippets
YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video [Paper][Homepage] 380,000 video segments about 19s long, 5.6 M bounding boxes, 23 types of objects
Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations (CVPR 2021) [Paper][Homepage] 15K annotated video clips supplemented with over 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes, manually annotated 3D bounding boxes for each object
DroneCrowd: Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark (CVPR 2021) [Paper][Homepage] 112 video clips with 33,600 HD frames in various scenarios, 20,800 people trajectories with 4.8 million heads and several video-level attributes
BOLD: Detecting Biological Locomotion in Video: A Computational Approach [Paper] 1,348 videos, objects: human, terrestrial quadruped, bird, reptile, cetacean, seal, fish, stingray, eel, sea snake, insects, spiders, scorpion, lobster, ball, car, train, motorbike, submarine, airplane, helicopter, rocket, oscillating stuff
VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection (ICCV 2021) [Paper][Homepage] contains 100 videos with in total 10,000 frames, acquired from different real traffic scenarios
Water detection through spatio-temporal invariant descriptors [Paper][Dataset] 260 videos
Volleyball: A Hierarchical Deep Temporal Model for Group Activity Recognition [Paper][Homepage] 4,830 clips, 8 group activity classes, nine individual actions
NBA: Social Adaptive Module for Weakly-supervised Group Activity Recognition (ECCV 2020) [Paper][Homepage] 181 videos, 9,172 video clips, 9 activities
Collective: What are they doing? : Collective activity classification using spatio-temporal relationship among people [Paper][Homepage] 5 different collective activities, 44 clips
HOLLYWOOD2: Actions in Context (CVPR 2009) [Paper][Homepage] 12 classes of human actions, 10 classes of scenes, 3,669 clips, 69 movies
HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do [Paper][Homepage] 10 movies from Romance, Drama, Fantasy, Adventure, Comedy
MPII-MD: A Dataset for Movie Description [Paper][Homepage] 94 videos, 68,337 clips, 68,375 descriptions
MovieNet: A Holistic Dataset for Movie Understanding (ECCV 2020) [Paper][Homepage] 1,100 movies, 1.1M characters with bounding boxes and identities, 42K scene boundaries, 2.5K aligned description sentences, 65K tags of place and action, and 92K tags of cinematic style
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions [Paper] 384,000 natural language sentences grounded in over 1,200 hours of video
MovieQA: Story Understanding Benchmark (CVPR 2016) [Paper][Homepage] 14,944 questions, 408 movies
Video Person-Clustering Dataset: Face, Body, Voice: Video Person-Clustering with Multiple Modalities [Paper][Homepage] multi-modal annotations (face, body and voice) for all primary and secondary characters from a range of diverse TV-shows and movies
MovieGraphs: Towards Understanding Human-Centric Situations from Videos (CVPR 2018) [Paper][Homepage] 7,637 movie clips, 51 movies, annotations: scene, situation, description, graph (Character, Attributes, Relationship, Interaction, Topic, Reason, Time stamp)
Condensed Movies: Story Based Retrieval with Contextual Embeddings (ACCV 2020) [Paper][Homepage] 33,976 captioned clips from 3,605 movies, 400K+ face-tracks, 8K+ labelled characters, 20K+ subtitles, densely pre-extracted features for each clip (RGB, Motion, Face, Subtitles, Scene)
VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events [Paper][Homepage] 45,826 videos and their descriptions obtained by harvesting YouTube
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (CVPR 2016) [Paper][Homepage] 10K web video clips, 200K clip-sentence pairs
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research (ICCV 2019) [Paper][Homepage] 41,250 videos, 825,000 captions in both English and Chinese, over 206,000 English-Chinese parallel translation pairs
ActivityNet Captions: Dense-Captioning Events in Videos (ICCV 2017) [Paper][Homepage] 20k videos, 100k sentences
ActivityNet Entities: Grounded Video Description [Paper][Homepage] 14,281 annotated videos, 52k video segments with at least one noun phrase annotated per segment, augment the ActivityNet Captions dataset with 158k bounding box
WebVid-2M: Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval (2021) [Paper][Homepage] over two million videos with weak captions scraped from the internet
VTW: Title Generation for User Generated Videos (ECCV 2016) [Paper][Homepage] 18100 video clips with an average of 1.5 minutes duration per clip
TGIF: A New Dataset and Benchmark on Animated GIF Description (CVPR 2016) [Paper][Homepage] 100K animated GIFs from Tumblr and 120K natural language descriptions
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions (CVPR 2021) [Paper][Homepage] a video description dataset with over 500K different short videos depicting a broad range of different events
Charades: Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ECCV 2016) [Paper][Homepage] 9,848 annotated videos, 267 people, 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes
Pano2Vid: Automatic Cinematography for Watching 360° Videos (ACCV 2016) [Paper][Homepage] 20 out of 86 360° videos have labels for testing; 9,171 normal videos captured by humans for inference in training; topics: Soccer, Mountain Climbing, Parade, and Hiking)
Deep 360 Pilot: Learning a Deep Agent for Piloting through 360◦ Sports Videos (CVPR 2017) [Paper][Homepage] 342 360° videos, topics: basketball, parkour, BMX, skateboarding, and dance
YT-ALL: Self-Supervised Generation of Spatial Audio for 360◦ Video (NeurIPS 2018) [Paper][Homepage] 1,146 videos, half of the videos are live music performances
YT360: Learning Representations from Audio-Visual Spatial Alignment (NeurIPS 2020) [Paper][Homepage] topics: musical performances, vlogs, sports, and others
Hollywood2Tubes: Spot On: Action Localization from Pointly-Supervised Proposals [Paper][Dataset] train: 823 videos, 1,026 action instances, 16,411 annotations; test: 884 videos, 1,086 action instances, 15,835 annotations
DALY: Human Action Localization with Sparse Spatial Supervision [Paper][Homepage] 10 actions, 3.3M frames, 8,133 clips
Action Completion: A temporal model for Moment Detection (BMVC 2018) [Paper][Homepage] completion moments of 16 actions from three datasets: HMDB, UCF101, RGBD-AC
RGBD-Action-Completion: Beyond Action Recognition: Action Completion in RGB-D Data (BMVC 2016) [Paper][Homepage] 414 complete/incomplete object interaction sequences, spanning six actions and captured using an RGB-D camera
P2A: A Dataset and Benchmark for Dense Action Detection from Table Tennis Match Broadcasting Videos (2022) [Paper] 2,721 video clips collected from the broadcasting videos of professional table tennis matches in World Table Tennis Championships and Olympiads, 14 classes, 139,075 segments, recognition \& localization
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions [Paper][Homepage] 80 atomic visual actions in 430 15-minute video clips, 1.58M action labels with multiple labels per person occurring frequently
AVA-Kinetics: The AVA-Kinetics Localized Human Actions Video Dataset [Paper][Homepage] 230k clips, 80 AVA action classes
TSU: Toyota Smarthome Untrimmed: Real-World Untrimmed Videos for Activity Detection (TPAMI 2022) [Paper][Homepage] 536 videos with an average duration of 21 mins
MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions (ICCV 2021) [Paper][Homepage] selecting 4 sports classes, collecting around 3200 video clips, and annotating around 37790 action instances with 907k bounding boxes.
HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization [Paper][Homepage] HACS Clips: 1.5M annotated clips sampled from 504K untrimmed videos, HACS Segments: 139K action segments densely annotated in 50K untrimmed videos spanning 200 action categories
CommonLocalization: Localizing the Common Action Among a Few Videos (ECCV 2020) [Paper][Homepage] few-shot common action localization, revised ActivityNet1.3 and Thumos14
CommonSpaceTime: Few-Shot Transformation of Common Actions into Time and Space (CVPR 2021) [Paper][[Homepage]()] revised AVA and UCF101-24
FineAction: A Fined Video Dataset for Temporal Action Localization [Paper][Homepage] 139K fined action instances densely annotated in almost 17K untrimmed videos spanning 106 action categories
MUSES: Multi-shot Temporal Event Localization: a Benchmark (CVPR 2021) [Paper][Homepage] 31,477 event instances, 716 video hours, 19 shots per instance, 176 shots per video, 25 categories, 3,697 videos
MEVA: A Large-Scale Multiview, Multimodal Video Dataset for Activity (WACV 2021) [Paper][Homepage] annotated 144 hours for 37 activity types, marking bounding boxes of actors and props, 38 RGB and thermal IR cameras
TVSeries: Online Action Detection (ECCV 2016) [Paper][Homepage] 27 episodes from 6 popular TV series, 30 action classes, 6,231 action instances
SEAL: A Large-scale Video Dataset of Multi-grained Spatio-temporally Action Localization [Paper] Tubes: 49.6k atomic actions spanning 172 action categories and 17.7k complex activities spanning 200 activity categories; Clips: 510.4k action labels with multiple labels per person
JRDB-Act: A Large-scale Dataset for Spatio-temporal Action, Social Group and Activity Detection (CVPR 2022) [Paper][Homepage] densely annotated with atomic actions, comprises over 2.8M action labels. Each human bounding box is labeled with one pose-based action label and multiple (optional) interaction-based action labels. Moreover JRDB-Act provides social group annotation, conducive to the task of grouping individuals based on their interactions in the scene to infer their social activities (common activities in each social group)
Lingual OTB99 & Lingual ImageNet Videos: Tracking by Natural Language Specification (CVPR 2017) [Paper][Homepage] natural language descriptions of the target object
MPII-MD: A Dataset for Movie Description [Paper][Homepage] 94 videos, 68,337 clips, 68,375 descriptions
DiDeMo: Localizing Moments in Video with Temporal Language (EMNLP 2018) [Paper][Homepage] training, validation and test sets containing 8,395, 1,065 and 1,004 videos, each video is trimmed to a maximum of 30 seconds
Narrated Instruction Videos: Unsupervised Learning from Narrated Instruction Videos [Paper][Homepage] 150 videos, 800,000 frames, five tasks: Making a coffee, Changing car tire, Performing cardiopulmonary resuscitation (CPR), Jumping a car and Repotting a plant
YouCook: A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching (CVPR 2013) [Paper][Homepage] 88 YouTube cooking videos
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions [Paper] 384,000 natural language sentences grounded in over 1,200 hours of video
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips (ICCV 2019) [Paper][Homepage] 136 million video clips sourced from 1.22M narrated instructional web videos, 23k different visual tasks
How2: A Large-scale Dataset for Multimodal Language Understanding (NeurIPS 2018) [Paper][Homepage] 80,000 clips, word-level time alignments to the ground-truth English subtitles
Breakfast: The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities [Paper][Homepage] 52 participants, 10 distinct cooking activities captured in 18 different kitchens, 48 action classes, 11,267 clips
EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020) [Paper][Homepage] 100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes
YouCook2: YouCookII Dataset [Paper][Homepage] 2000 long untrimmed videos, 89 cooking recipes, each recipe includes 5 to 16 steps, each step should be described with one sentence
WebVid-2M: Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval (2021) [Paper][Homepage] over two million videos with weak captions scraped from the internet
QuerYD: A video dataset with textual and audio narrations (ICASSP 2021) [Paper][Homepage] 1,400+ narrators, 200+ video hours, 70+ description hours
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference (CVPR 2020) [Paper][Homepage] 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video
CrossTask: weakly supervised learning from instructional videos (CVPR 2019) [Paper][Homepage] 4.7K videos, 83 tasks
HD-VILA-100M: Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions (CVPR 2022) [Paper][Homepage] 103M video clips with transcriptions
EPIC-Kitchens: Multi-Modal Domain Adaptation for Fine-Grained Action Recognition (CVPR 2020) [Paper][Homepage] 3 domains, 8 action classes
MPII-Cooking: Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data (IJCV 2015) [Paper][Homepage] 67 fine-grained activities, 59 composite activities, 14,105 clips, 273 videos
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding (CVPR 2020) [Paper][Homepage] 10 event categories, including 6 male events and 4 female events, 530 element categories
FineAction: A Fined Video Dataset for Temporal Action Localization [Paper][Homepage] 139K fined action instances densely annotated in almost 17K untrimmed videos spanning 106 action categories
SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events [Paper][Homepage] 10,080 in-the-wild videos and annotated 62,535 QA pairs
R2VQ: Recipe-to-Video Questions [[Paper]()][Homepage]
Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence (CVPR 2019) [Paper][Homepage] 1,250 videos, 7500 questions, 30, 000 correct answers and 22,500 incorrect answers
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering (CVPR 2017) [Paper][Homepage] 165K QA pairs for the animated GIFs from the TGIF dataset
MovieQA: Story Understanding Benchmark (CVPR 2016) [Paper][Homepage] 14,944 questions, 408 movies
MarioQA: Answering Questions by Watching Gameplay Videos (ICCV 2017) [Paper][Homepage] 13 hours of gameplays, 187,757 examples with automatically generated QA pairs; 92,874 unique QA pairs and each video clip contains 11.3 events in average
TVQA: Localized, Compositional Video Question Answering (EMNLP 2018) [Paper][Homepage] 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR 2021) [Paper][Homepage] 5440 videos with average length of 44s and about 52K manually annotated question-answer pairs grouped into causal (48%),temporal (29%) and descriptive (23%) questions
A2D: Can Humans Fly? Action Understanding with Multiple Classes of Actors (CVPR 2015) [Paper][Homepage] 3,782 videos, actors: adult, baby, bird, cat, dog, ball and car, actions: climbing, crawling, eating, flying, jumping, rolling, running, and walking
J-HMDB: Towards understanding action recognition (ICCV 2013) [Paper][Homepage] 31,838 annotated frames, 21 categories involving a single person in action: brush hair, catch, clap, climb stairs, golf, jump, kick ball, pick, pour, pull-up, push, run, shoot ball, shoot bow, shoot gun, sit, stand, swing baseball, throw, walk, wave
A2D Sentences & J-HMDB Sentences: Actor and Action Video Segmentation from a Sentence (CVPR 2018) [Paper][Homepage] A2D Sentences: 6,656 sentences, including 811 different nouns, 225 verbs and 189 adjectives, J-HMDB Sentences: 928 sentences, including 158 different nouns, 53 verbs and 23 adjectives
QUVA Repetition: Real-World Repetition Estimation by Div, Grad and Curl (CVPR 2018) [Paper][Homepage] 100 videos
YTSegments: Live Repetition Counting (ICCV 2015) [Paper][Homepage] 100 videos
UCFRep: Context-Aware and Scale-Insensitive Temporal Repetition Counting (CVPR 2020) [Paper][Homepage] 526 videos
Countix: Counting Out Time: Class Agnostic Video Repetition Counting in the Wild (CVPR 2020) [Paper][Homepage] 8,757 videos
Countix-AV & Extreme Countix-AV: Repetitive Activity Counting by Sight and Sound (CVPR 2021) [Paper][Homepage] 1,863 videos in Countix-AV, 214 videos in Extreme Countix-AV
Audio Set: An ontology and human-labeled dataset for audio events (ICASSP 2017) [Paper][Homepage] 632 audio event classes, 2,084,320 human-labeled 10-second sound clips
MUSIC: The Sound of Pixels (ECCV 2018)
[Paper][Homepage]
685 untrimmed videos, 11 instrument categories
AudioSet ZSL: Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zero-shot Classification and Retrieval of Videos (WACV 2020) [Paper][Homepage] 33 classes, 156,416 videos
Kinetics-Sound: Look, Listen and Learn (ICCV 2017) [Paper] 34 action classes from Kinetics
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning (ICCV 2021) [Paper][Homepage] 100 million 10-second clips
EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020) [Paper][Homepage] 100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes
SoundNet: Learning Sound Representations from Unlabeled Video (NIPS 2016) [Paper][Homepage] 2+ million videos
AVE: Audio-Visual Event Localization in Unconstrained Videos (ECCV 2018) [Paper][Homepage] 4,143 10-second videos, 28 audio-visual events
LLP: Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing (ECCV 2020) [Paper][Homepage] 11,849 YouTube video clips, 25 event categories
VGG-Sound: A large scale audio-visual dataset [Paper][Homepage] 200k videos, 309 audio classes
YouTube-ASMR-300K: Telling Left from Right: Learning Spatial Correspondence of Sight and Sound (CVPR 2020) [Paper][Homepage] 300K 10-second video clips with spatial audio
XD-Violence: Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision (ECCV 2020) [Paper][Homepage] 4754 untrimmed videos
VGG-SS: Localizing Visual Sounds the Hard Way (CVPR 2021) [Paper][Homepage] 5K videos, 200 categories
VoxCeleb: Large-scale speaker verification in the wild [Paper][Homepage] a million ‘real-world’ utterances, over 7000 speakers
EmoVoxCeleb: Emotion Recognition in Speech using Cross-Modal Transfer in the Wild [Paper][Homepage] 1,251 speakers
Speech2Gesture: Learning Individual Styles of Conversational Gesture (CVPR 2019) [Paper][Homepage] 144-hour person-specific video, 10 speakers
AVSpeech: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation [Paper][Homepage] 150,000 distinct speakers, 290k YouTube videos
LRW: Lip Reading in the Wild (ACCV 2016) [Paper][Homepage] 1000 utterances of 500 different words
LRW-1000: LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild (FG 2019) [Paper][Homepage] 718018 video samples from 2000+ individual speakers of 1000 Mandarin words
LRS2: Deep Audio-Visual Speech Recognition (TPAMI 2018) [Paper][Homepage] Thousands of natural sentences from British television
LRS3-TED: a large-scale dataset for visual speech recognition [Paper][Homepage] thousands of spoken sentences from TED and TEDx videos
CMLR: A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading (ACM MM Asia 2019) [Paper][Homepage] 102072 spoken sentences of 11 speakers from national news program in China (CCTV)
APES: Audiovisual Person Search in Untrimmed Video [Paper][Homepage] untrimmed videos whose audio (voices) and visual (faces) streams are densely annotated, over 1.9K identities labeled along 36 hours of video, dense temporal annotations that link faces to speech segments of the same identity
Countix-AV & Extreme Countix-AV: Repetitive Activity Counting by Sight and Sound (CVPR 2021) [Paper][Homepage] 1,863 videos in Countix-AV, 214 videos in Extreme Countix-AV
EPIC-Skills: Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination (CVPR 2018) [Paper][Homepage] 3 tasks, 113 videos, 1000 pairwise ranking annotations
BEST: The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos (CVPR 2019) [Paper][Homepage] 5 tasks, 500 videos, 13000 pairwise ranking annotations
AQA-7: Action Quality Assessment Across Multiple Actions (WACV 2019) [Paper][Homepage] 1,189 samples from 7 sports: 370 from single diving - 10m platform, 176 from gymnastic vault, 175 from big air skiing, 206 from big air snowboarding, 88 from synchronous diving - 3m springboard, 91 from synchronous diving - 10m platform and 83 from trampoline
AQA-MTL: What and how well you performed? A multitask approach to action quality assessment (CVPR 2019) [Paper][Homepage] 1,412 fine-grained samples collected from 16 different events with various views
Olympic Scoring Dataset: Learning to score Olympic events (CVPR 2017 Workshop) [Paper][Homepage] The existing MIT diving dataset is doubled from 159 samples to 370 examples. A new gymnastic vault dataset consisting of 176 samples has been collected
JIGSAWS: JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS): A Surgical Activity Dataset for Human Motion Modeling [Paper][Homepage] surgical activities, 3 tasks: “Suturing (S)”, ”Needle Passing (NP)” and “Knot Tying (KT)”, each video is annotated with multiple annotation scores assessing different aspects of a video
FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment (CVPR 2022) [Paper][Dataset] 3000 video samples, covering 52 action types, 29 sub-action types, and 23 difficulty degree types
HowTo100M Adverbs: Action Modifiers: Learning from Adverbs in Instructional Videos (CVPR 2020) [Paper][Homepage] 5,824 clips, 72 actions, 6 adverbs, 263 pairs
VATEX Adverbs: How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs (CVPR 2022) [Paper][Homepage] 14,617 cilps, 34 adverbs, 135 actions, 1,550 pairs
MSR-VTT Adverbs: How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs (CVPR 2022) [Paper][Homepage] 1,824 cilps, 18 adverbs, 106 actions, 464 pairs
ActivityNet Adverbs: How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs (CVPR 2022) [Paper][Homepage] 3,099 cilps, 20 adverbs, 114 actions, 643 pairs
TRECVID Challenge: TREC Video Retrieval Evaluation [Homepage] sources: YFCC100M, Flickr, etc
Video Browser Showdown – The Video Retrieval Competition [Homepage]
TRECVID-VTT: TRECVID 2019: An Evaluation Campaign to Benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & Retrieval [Paper][Homepage] 9185 videos with captions
V3C - A Research Video Collection [Paper][Homepage] 7475 Vimeo videos, 1,082,657 short video segments
IACC: Creating a web-scale video collection for research [Paper][Homepage] 4600 Internet Archive videos
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (ECCV 2020) [Paper][Homepage] 108,965 queries on 21,793 videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal alignment
EPIC-Kitchens: Multi-Modal Domain Adaptation for Fine-Grained Action Recognition (CVPR 2020) [Paper][Homepage] 3 domains, 8 action classes
CharadesEgo: Actor and Observer: Joint Modeling of First and Third-Person Videos (CVPR 2018) [Paper][Homepage] 2 domains (1st-person and 3rd-person views), 157 action classes, 4,000 paired videos, multi-class classification
UCF-HMDB: Temporal Attentive Alignment for Large-Scale Video Domain Adaptation (ICCV 2019) [Paper][Dataset] 3,209 videos with 12 action classes
Kinetics-NEC-Drone: Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones (WACV 2020) [Paper][Homepage] Source domain: Kinetics (13 classes), target domain: NEC-Drone (7 classes, 5,250 videos)
ActorShift: Audio-Adaptive Activity Recognition Across Video Domains (CVPR 2022) [Paper][Homepage] source domain: 1,305 videos of 7 human activity classes from Kinetics-700, target domain: 200 videos with animal actors
Lingual OTB99 & Lingual ImageNet Videos: Tracking by Natural Language Specification (CVPR 2017) [Paper][Homepage] natural language descriptions of the target object
OxUvA: Long-term Tracking in the Wild: A Benchmark (ECCV 2018) [Paper][Homepage] 366 sequences spanning 14 hours of video
LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking [Paper][Homepage] 1,400 sequences with more than 3.5M frames, each frame is annotated with a bounding box
TNL2K: Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark (CVPR 2021) [Paper][Homepage] 2k video sequences (contains a total of 1,244,340 frames, 663 words) and split 1300/700 for the train/testing respectively; densely annotate one sentence in English and corresponding bounding boxes of the target object for each video.
TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild (ECCV 2018) [Paper][Homepage] 30K videos with more than 14 million dense bounding box annotations, a new benchmark composed of 500 novel videos
ALOV300+: Visual Tracking: An Experimental Survey (TPAMI 2014) [Paper][Homepage][Dataset] 315 videos
NUS-PRO: A New Visual Tracking Challenge (TPAMI 2015) [Paper][Homepage] 365 image sequences
UAV123: A Benchmark and Simulator for UAV Tracking (ECCV 2016) [Paper][Homepage] 123 new and fully annotated HD video sequences captured from a low-altitude aerial perspective
OTB2013: Online Object Tracking: A Benchmark (CVPR 2013) [Paper][Homepage] 50 video sequences
OTB2015: Object Tracking Benchmark (TPAMI 2015) [Paper][Homepage] 100 video sequences
VOT Challenge [Homepage]
MOT Challenge [Homepage]
VisDrone: Vision Meets Drones: A Challenge [Paper][Homepage]
TAO: A Large-Scale Benchmark for Tracking Any Object [Paper][Homepage] 2,907 videos, 833 classes, 17,287 tracks
GMOT-40: A Benchmark for Generic Multiple Object Tracking [Paper][Homepage] 40 carefully annotated sequences evenly distributed among 10 object categories
BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning (CVPR 2020) [Paper][Homepage] 100K videos and 10 tasks
DroneCrowd: Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark (CVPR 2021) [Paper][Homepage] 112 video clips with 33,600 HD frames in various scenarios, 20,800 people trajectories with 4.8 million heads and several video-level attributes
KIEV: Interactivity Proposals for Surveillance Videos [Paper][Homepage] a new task of spatio-temporal interactivity proposals
ImageNet-VidVRD: Video Visual Relation Detection [Paper][Homepage] 1,000 videos, 35 common subject/object categories and 132 relationships
VidOR: Annotating Objects and Relations in User-Generated Videos [Paper][Homepage] 10,000 videos selected from YFCC100M collection, 80 object categories and 50 predicate categories
Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks (CVPR 2020) [Paper][Homepage] annotations for 180049 videos from the Something-Something Dataset
Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs (CVPR 2020) [Paper][Homepage] 10K videos, 0.4M objects, 1.7M visual relationships
VidSitu: Visual Semantic Role Labeling for Video Understanding (CVPR 2021) [Paper][Homepage] 29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds
MOMA: Multi-Object Multi-Actor Activity Parsing [Paper][Homepage] 373 raw videos at activity level, 2,364 trimmed videos at sub-activity level, and 12,057 atomic action instances. This includes 17 activity classes, 67 sub-activity classes, and 52 atomic action classes. At frame level, action hypergraph annotations for 37,428 frames, with 164,162 actor/object instances of 20 actor classes and 120 object classes, and 119,132 relationship instances of 75 relationship classes
XD-Violence: Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision (ECCV 2020) [Paper][Homepage] 4,754 untrimmed videos
UCF-Crime: Real-world Anomaly Detection in Surveillance Videos [Paper][Homepage] 1,900 videos
UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection (CVPR 2022) [Paper][Homepage] abnormal events annotated at the pixel level at training time
YouTube Highlights: Ranking Domain-specific Highlights by Analyzing Edited Videos [Paper][Homepage] six domain-specific categories: surfing, skating, skiing, gymnastics, parkour, and dog. Each domain consists of around 100 videos and the total accumulated time is 1430 minutes
PHD2: Personalized Highlight Detection for Automatic GIF Creation [Paper][Homepage] the training set contains highlights from 12,972 users, the test set contains highlights from 850 users
TVSum: Summarizing web videos using titles (CVPR 2015) [Paper][Homepage] 50 videos of various genres (e.g., news, how-to, documentary, vlog, egocentric) and 1,000 annotations of shot-level importance scores obtained via crowdsourcing (20 per video)
QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries (2021) [Paper][Homepage] over 10,000 YouTube videos, each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips
SumMe: Creating Summaries from User Videos (ECCV 2014) [Paper][Homepage] 25 videos, each annotated with at least 15 human summaries (390 in total)
QFVS: Query-Focused Video Summarization: Dataset, Evaluation, and A Memory Network Based Approach (CVPR 2017) [Paper][Homepage] 300 hours of videos
TVSum: Summarizing web videos using titles (CVPR 2015) [Paper][Homepage] 50 videos of various genres (e.g., news, how-to, documentary, vlog, egocentric) and 1,000 annotations of shot-level importance scores obtained via crowdsourcing (20 per video)
YouTube Pose: Personalizing Human Video Pose Estimation (CVPR 2016) [Paper][Homepage] 50 videos, 5,000 annotated frames
JHMDB: Towards understanding action recognition (ICCV 2013) [Paper][Homepage] 5,100 clips of 51 different human actions collected from movies or the Internet, 31,838 annotated frames in total
Penn Action: From Actemes to Action: A Strongly-supervised Representation for Detailed Action Understanding (ICCV 2013) [Paper][Homepage] 2,326 video sequences of 15 different actions and human joint annotations for each sequence
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes (RSS 2018) [Paper][Homepage] accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames
Edinburgh Pig Behavior: Extracting Accurate Long-Term Behavior Changes from a Large Pig Dataset [Paper][Homepage] 23 days (over 6 weeks) of daytime pig video captured from a nearly overhead camera, 6 frames per second, and stored in batches of 1800 frames (5 minutes), most frames show 8 pigs
Kagu bird wildlife monitoring: Spatio-Temporal Event Segmentation and Localization for Wildlife Extended Videos [Paper][Homepage] 10 days (254 hours) of continuous wildlife monitoring data; the labels include four unique bird activities, {feeding the chick, incubation/brooding, nest building while sitting on the nest, nest building around the nest}; start and end times for each instance of these activities are provided with the annotations
Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding (CVPR 2022) [Paper][Homepage] 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the finegrained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes.
iLIDS-VID: Person re-identification by video ranking (ECCV 2014) [Paper][Homepage] 600 image sequences of 300 distinct individuals
PRID-2011: Person Re-identification by Descriptive and Discriminative Classification [Paper][Homepage] 400 image sequences for 200 identities from two non-overlapping cameras
MARS: A Video Benchmark for Large-Scale Person Re-Identification (ECCV 2016) [Paper][Homepage] 1,261 identities around 18,000 video sequences
Dynamic Texture: A New Large Scale Dynamic Texture Dataset with Application to ConvNet Understanding (ECCV 2018) [Paper][Homepage] over 10,000 videos
YUVL: Spacetime Texture Representation and Recognition Based on a Spatiotemporal Orientation Analysis (TPAMI 2012) [Paper][Homepage][Dataset] 610 spacetime texture samples
UCLA: Dynamic Texture Recognition (CVPR 2001) [Paper][Dataset] 76 dynamic textures
MultiBench: Multiscale Benchmarks for Multimodal Representation Learning [Homepage] spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas
M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining (CVPR 2022) [Paper][Homepage] 6 Million multi-modal samples, 5k properties with 24 Million values, 5 modalities-image text table video audio, 6 Million category annotations with 6k classes, Wide data source (1 Million merchants provide)
Real-world Flag & FlagSim: Cloth in the Wind: A Case Study of Physical Measurement through Simulation (CVPR 2020) [Paper][Homepage] Real-world Flag: 2.7K train and 1.3K videos clips, FlagSim: 1,000 mesh sequences, 14, 000 training examples
Physics 101: Learning Physical Object Properties from Unlabeled Videos (BMVC 2016) [Paper][Homepage] over 10,000 video clips containing 101 objects of various materials and appearances (shapes, colors, and sizes)
CLEVRER: Collision Events for Video Representation and Reasoning (ICLR 2020) [Paper][Homepage] 10,000 videos for training, 5,000 for validation, and 5,000 for testing; all videos last for 5 seconds; the videos are generated by a physics engine that simulates object motion plus a graphs engine that renders the frames