I'm specifically wondering how the target person is identified, since sometimes there are other people visible in the frame (usually further in the background), so if the model predicts all of them, one must be picked for evaluation. Is the bounding box chosen from the detection result based on similarity to the GT annotated person's bbox? Or is the GT bbox used for prediction?
Thanks again for this useful repo. I'm trying to understand how evaluation is done on RICH. It seems the current code is incomplete here https://github.com/yohanshin/WHAM/blob/2b54f7797391c94876848b905ed875b154c4a295/lib/data_utils/rich_eval_utils.py#L61
I'm specifically wondering how the target person is identified, since sometimes there are other people visible in the frame (usually further in the background), so if the model predicts all of them, one must be picked for evaluation. Is the bounding box chosen from the detection result based on similarity to the GT annotated person's bbox? Or is the GT bbox used for prediction?