yeliudev / ConsNet

🚴‍♂️ ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection (MM 2020)
https://arxiv.org/abs/2008.06254
GNU General Public License v3.0
33 stars 2 forks source link

the build_dataset function in build_dataset.py file #6

Closed zhangzhuoran997997 closed 3 years ago

zhangzhuoran997997 commented 3 years ago

Thanks for you work! I got some questions:

yeliudev commented 3 years ago

@zhangzhuoran997997 Thanks for your interest in our work!

  1. In the fuction build_dataset, h_blob and o_blob store the bounding boxes, object detection scores, and other information about human (h) or object (o) instances. In HICO-DET, aside from human-object interactions, there are also human-human interactions. In this case, the latter human should be considered as 'object'. So that's why we concatenate objects with humans to obtain pair proposals for these interactions.
  2. max_h_as_o means the maximum number of humans to be considered as objects in each image. If max_h_as_o > 0, only the top max_h_as_o humans with the highest object detection scores are concatenated with objects. This argument is set to -1 for the training set and 3 for the test set to reduce the number of pair proposals.
zhangzhuoran997997 commented 3 years ago

Really appreciate your answer!

I'm a beginner, so the problem will be a little simple.

I have other questions that:

yeliudev commented 3 years ago
  1. dt_blobs stores the information of humans and objects detected by the object detector, while gt_blobs stores the features of the ground truth bboxes from HICO-DET. The other information (bboxes, classes) about these instances are loaded from the annotation file (anno_bbox.mat).
  2. The object detector we used inherits from mmdetection's TwoStageDetector. We add several modifications to make it possible to extract hidden features from it. For details of the detector, you may refer to this.

Please feel free to ask me again if you have any other questions :)

zhangzhuoran997997 commented 3 years ago

Thank you very much for the timely answers. Now, I have a comprehensive understanding of the code and each module, but there are still some questions:

yeliudev commented 3 years ago
  1. self._use_cache is False in training mode and turns True in test mode (after you explicitly invoke model.eval()). This is because the input of SemanticBlock (GATs) are fixed (word embeddings from pretrained ELMo) so that its output can be cached for more efficient inference.
  2. All the parameters (including Mapper block, Fusion block, and GATs) except for the ones belong to the object detector are updated during training.
  3. convert_annotation is used to convert the annotations of HICO-DET to COCO format. We indeed used these converted annotations to finetune the detector using mmdet.
zhangzhuoran997997 commented 3 years ago

Really appreciate your answer!

I have a question that do I need to change a lot if I want training with multiple GPUs?

yeliudev commented 3 years ago

Sorry for my late reply. Currently, NNDataParallel only supports single GPU training. If you would like to train on multiple GPUs, you may use NNDistributedDataParallel, which is similar to DistributedDataParallel. Some minor changes on the code should also be made. However, there is no need to use multiple GPUs for this model. If you want to increase the batch size, you can simply change the values of batch_size in the config files.

yeliudev commented 3 years ago

I'm closing this issue. Please feel free to re-open it if you have any further questions.