Open lumiaomiao opened 3 years ago
Sorry for getting confusing you.
The object detection results provide both object category information and bounding boxes. Here, we only use the bounding boxes for inferring the HOI category. The training phase is the same as the previous setting. In fact, * means we use the same model as ATL, but do not use the object category information during inference.
feel free to contact me if you have further question,
Regards,
Thank you for your replay.
@zhihou7 Hi, I have another question about the code. The function get_new_Trainval_N in lib/ult/ult.py is definied as :
Why use " Trainval_N[4]" not " Trainval_N[k]" ?
Thanks for your comment. It should be Tranval_N[k]. It is a bug from the code of VCL. I forget to update the code. After fixing this bug, the performance will be improved a bit. This bug also does not add seen classes for zero-shot setting. Therefore, it just affects the performance a bit.
I have updated the code.
Thanks.
Thank you for your quick reply.
@zhihou7 As following codes, if an image contains two pairs <h1, v1, o1>, <h1, v2, o1> , and the first one is in the unseen composition list, then you delete two pair from training data. Why don't you only delete the first one ? In my view, only deleting the first one is more close to your description in paper.
Here, GT[1] is HOI label list of a HOI sample, e.g., [eat apple, hold apple]. If "eat apple" is unseen category. I think it is fair to remove this HOI sample, rather than remove the annotation [eat apple]. Otherwise, the sample of "eat apple" is still existing, but is not labeled, which I think is different from the setting of zero-shot.
I get it, thank you.
Hi, could you explain the in Table3 in ATL? You described it as " means we only use the boxes of the detection results", but how do you use the category of the detection results in training phrase and inference phrase ?