Closed zyong812 closed 2 years ago
Hi! Good question:)
I discussed with Ji (the author of AG and "detecting human-object relationships in videos") when AG was just released. All "person" boxes are annotated with their FasterRCNN (I don't know if they update the human-labeled boxes now). Therefore object detection in their works is more accurate while we trained the object detector by ourselves. The SGCLS/SGDET numbers released in AG should be higher than ours (same baseline).
Moreover, in the paper "Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs", the predicate with the highest score between the subject-object pair is picked, without considering the relationship type. For example, the gt annotation could be person-look at-book/person-hold-book. They only output hold or look at. In our work, the attention type and contact type are separate. So our PredCLS numbers are higher than the numbers in AG.
I have no idea about the numbers in "Detecting human-object relationships in videos"... You see, Ji released quite different numbers for the same baselines in different papers. I think they used different evaluation settings. You'd better ask Ji for more details.
Thanks for sharing the nice work!
But I find the performance presented in the paper is very different with methods in "Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs" and "detecting human-object relationships in videos". What are the causes for this?