wei-tim / YOWO

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization
Other
846 stars 158 forks source link

Frame mAP on UCF24 #75

Closed alphadadajuju closed 3 years ago

alphadadajuju commented 3 years ago

Thank you for sharing your amazing work (I see that you even verified YOWO's performance on AVA recently)!

I do have some questions related to how you evaluated UCF-24's frame mAP, and was hoping you could help me clarify.

  1. It appears that YOWO's frame mAP is evaluated using "testlist.txt" provided along other annotation. However, testlist.txt is temporally trimmed (only containing frames that have action groundtruths). Does it mean that to calculate frame mAP, you do not consider frames without action groundtruths?
  2. If non-groundtruth frames are left out for evaluation, then wouldn't the computed frame mAP be higher, as the chance of false positives has been greatly reduced? (other studies such as ACT seemed to consider this false positives into account when evaluating frame mAP).

(It may be the case that I didn't fully grasp your code ... please don't hesitate to correct me if I made a mistake!) Thank you again.

okankop commented 3 years ago

@alphadadajuju yes you are right. To calculte f-mAP you need ground truths, such that you can calculate the IoU with ground truths and predictions. Therefore, we have calculated f-mAP where we have annottions. However, for video-mAP we have used untrimmed videos.

For AVA dataset, only frame-mAP is used for evaluation and annotations are provided for sampling rate of 1 Hz.

alphadadajuju commented 3 years ago

Thank you for your clarifications!