zhihou7 / HOI-CL

Series of work (ECCV2020, CVPR2021, CVPR2021, ECCV2022) about Compositional Learning for Human-Object Interaction Exploration
https://sites.google.com/view/hoi-cl
MIT License
76 stars 11 forks source link

Confusion about Affordance Features #1

Closed ASMIftekhar closed 3 years ago

ASMIftekhar commented 3 years ago

Hello, thanks for your nice works. After reading the ATL paper I am confused about the affordance features. You said in the paper

We first extract the human, object, and affordance features via the ROI-Pooling from the feature pyramids.

What are these affordance features actually? I mean from where these features are pooled?

zhihou7 commented 3 years ago

Hi, thanks for you interests. In our experiment, the affordance features are pooled from the union box of human and object. i.e. the same as verb in VCL and FCL. We think verb describes the existing interaction from person perspective, while affordance illustrates the interaction possibilities( or action possibilities) of an object.

Regards,

ASMIftekhar commented 3 years ago

Ok got it. In that case, I am curious what would happen if we just use humans as affordance features. Affordance features are basically concatenating with object features to compose new HOI. It would be interesting to see the results. I am not sure if you have tested it already. Anyway, thanks for the reply.

zhihou7 commented 3 years ago

Hi, With human box feature as affordance, the performance of HOI detection decreases apparently (see Table 3 in VCL) compared to union box. However, human box with compositional approach still effectively increases the baseline. We also evaluate the human box in FCL where we witness the similar trend. (human box baseline is 22.91 16.66 24.77, human box FCL is 23.83 18.62 25.39). Thus, human box does not affect the effectiveness of compositional approach.

However, we do not evaluate human box on affordance recognition since we find union box achieves consistent improvement than human box. We think human box would not affect the effectiveness of visual compositional approach on affordance recognition compared to the baseline. For the comparison between human box and union box on affordance recognition, we intuitively think union box might be better because union box achieves better verb representation. But we are not sure. We have removed the model weights of human box and thus we can not evaluate this right now. In the process of considering your question, we find a set of experiment about the verb auxiliary loss which achieves better verb representation and HOI detection result,

method val2017 object365_coco gthico object365 HOI detection
baseline with verb auxiliary loss 19.71 17.86 23.18 6.80 23.44
baseline without verb auxiliary loss 19.77 17.85 27.23 6.90 22.83

The table (reported in mAP, we first evaluate ATL in F1. but we find F1 might be not robust compared to mAP when we prepare the camera ready) is corresponding to Tab 5 in Appendix. However, auxiliary loss doesn't seem to always improve affordance recognition. Thanks for your comments. we didn't notice this before.

ASMIftekhar commented 3 years ago

Thanks a lot for your clarification. I am just a bit skeptical to use union boxes as affordance features since union boxes have the old object features.

zhihou7 commented 3 years ago

You are welcome. Your question is valuable. I think the compositional approach (compose verb and object among different images) also enforces the verb representation be more discriminative (See the t-SNE figure in VCL). This approach might alleviate the effect from old object features. Otherwise, VCL would not improve the corresponding union box baseline.

When the affordance recognition result of human box model is finished, I'll post the result.

zhihou7 commented 3 years ago

Well, the mAP of ATL (HICO) model on HICO test dataset is 46.32, which is much worse than the result (59.44) of the corresponding union box model in Tab 12 in Appendix. I'll check the result again after the model converges.

zhihou7 commented 3 years ago

HOI detection performance of human box (ATL (HICO)) in 22. 99% mAP. The mAP on COCO validation2017 is 39.40%, which is also much worse than 52.01 of the corresponding union box model in Tab 12 in Appendix. All the results are worse than I thought.

ASMIftekhar commented 3 years ago

I really appreciate how you take the time and run experiments to answer my questions. You might wanna add this experiment in the supplemental material of the paper.

zhihou7 commented 3 years ago

Thanks. It is just because I have benefited a lot from taking questions (especially comments from peer review) seriously. The first two works (VCL & ATL) were rejected in the first submission. But the comments from reviewer make the paper better and sometimes inspire me a lot. I'll consider to add this experiment in the Appendix. emmm, it doesn't take time to run those small experiments. I just submit a job and wait for the results.

zhihou7 commented 3 years ago

hi, I add more experiments and update the pre-print version in arxiv: https://arxiv.org/abs/2104.02867. Interestingly, I find with human box verb representation, the performance of baseline increases, while the performance of ATL drops.