zeeshank95 / GVSR

MIT License
14 stars 2 forks source link

Grounded Video Situation Recognition (NeurIPS 2022)

LICENSE Python PyTorch Arxiv

Grounded Video Situation Recognition
Zeeshan Khan, C V Jawahar, Makarand Tapaswi

GVSR is a structured dense video understanding task. It is built on top of VidSitu. A large scale dataset containing videos of 10 seconds from complex movie scenes. Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, and where. GVSR affords this by recognising the action verbs, their corresponding roles, and localising them in the spatio-temporal domain in a weakly supervised setting, i.e. the supervision for grounding is provided only in form of role-captions without any ground truth bounding boxes.

This repository includes:

  1. Instructions to download the precomputed object and video features of the VidSitu Dataset.
  2. Instructions for Installing the GVSR dependencies.
  3. Code for both the frameworks of GVSR. (i) With ground truth roles and (ii) End-to-end GVSR.

Download

Please see DATA_PREP.md for the instructions on downloading and setting up the dataset and the pre-extracted object and video features.

Note: Running the code does not require the raw videos. If you wish to download the videos please refer to (https://github.com/TheShadow29/VidSitu/blob/main/data/DATA_PREP.md)

Installation

Please see INSTALL.md for setting up the conda environment and installing the dependencies

Training

Framework 1: Using Ground Truth Roles

Use the option grounded_vb_srl_GT_role for the --task_type argument, this means predicting verbs and semantic role captions provided the ground truth roles.

After each epoch, evalutation is performned for the 3 tasks: 1) Verb prediction 2) SRL(caption generation) and 3) Grounded SRL.

Framework 2: End-to-end Situation Recognition

Use the option grounded_end-to-end for the --task_type argument, this means predicting verbs, roles and, semantic role captions, wihtout using any intermediate ground truth data. This framework allows for end-to-end situation recognition.

After each epoch, evalutation is performned for the 3 tasks: 1)Verb prediction 2) SRL(caption generation) and 3) Grounded SRL.

Note:Evalutation for grounded SRL is coming soon!

Logging

Logs are stored inside tmp/ directory. When you run the code with $exp_name the following are stored:

Storing grounding results requires extra space and time during evaluation. To enable it use the argument --train.visualise_bboxes

Logs are also stored using MLFlow. These can be uploaded to other experiment trackers such as neptune.ai, wandb for better visualization of results.

Pretrained Model (Framework 1)

Download the pretrained model from here: Pretrained Model

place it in model_weights/

To evaluate the pretrained model, Run- CUDA_VISIBLE_DEVICES=0 python main_dist.py experiment1 --task_type=grounded_vb_srl_GT_role --only_val --train.resume --train.resume_path=model_weights/mdl_ep_11.pth --train.bs=16 --train.bsv=16

Prediction Format

  1. The output format for the files are as follows:

    1. Verb Prediction:

      List[Dict]
      Dict:
          # Both lists of length 5. Outer list denotes Events 1-5, inner list denotes Top-5 VerbID predictions
          pred_vbs_ev: List[List[str]]
          # Both lists of length 5. Outer list denotes Events 1-5, inner list denotes the scores for the Top-5 VerbID predictions
          pred_scores_ev: List[List[float]]
          #the index of the video segment used. Corresponds to the number in {valid|test}_split_file.json
          ann_idx: int
    2. Semantic Role Labeling Prediction:

      List[Dict]
      Dict:
          # same as above
          ann_idx: int
          # The main output used for evaluation. Outer Dict is for Events 1-5.
          vb_output: Dict[Dict]
          # The inner dict has the following keys:
              # VerbID of the event
              vb_id: str
              ArgX: str
              ArgY: str
              ...

      Note that ArgX, ArgY depend on the specific VerbID

    3. Grounded SRL:

      Folder_videoID
          frame 1: [box1, box2, ..] (for event 1 role 1)
          frame 2: [box1, box2, ..] (for event 1 role 2)
          .
          .
          frame t: [box1, box2, ..] (for event n role m)

Citation

@inproceedings{khan2022grounded,
        title={Grounded Video Situation Recognition},
        author={Zeeshan Khan and C.V. Jawahar and Makarand Tapaswi},
        booktitle={Advances in Neural Information Processing Systems},
        year={2022},
        url={https://openreview.net/forum?id=yRhbHp_Vh8e}
}

@InProceedings{Sadhu_2021_CVPR,
        author = {Sadhu, Arka and Gupta, Tanmay and Yatskar, Mark and Nevatia, Ram and Kembhavi, Aniruddha},
        title = {Visual Semantic Role Labeling for Video Understanding},
        booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
        month = {June},
        year = {2021}
}