yrcong / STTran

Spatial-Temporal Transformer for Dynamic Scene Graph Generation, ICCV2021
MIT License
187 stars 34 forks source link

Performance very different to Action Genome baselines #17

Closed zyong812 closed 2 years ago

zyong812 commented 2 years ago

Thanks for sharing the nice work!

But I find the performance presented in the paper is very different with methods in "Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs" and "detecting human-object relationships in videos". What are the causes for this?

yrcong commented 2 years ago

Hi! Good question:)

  1. I discussed with Ji (the author of AG and "detecting human-object relationships in videos") when AG was just released. All "person" boxes are annotated with their FasterRCNN (I don't know if they update the human-labeled boxes now). Therefore object detection in their works is more accurate while we trained the object detector by ourselves. The SGCLS/SGDET numbers released in AG should be higher than ours (same baseline).

  2. Moreover, in the paper "Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs", the predicate with the highest score between the subject-object pair is picked, without considering the relationship type. For example, the gt annotation could be person-look at-book/person-hold-book. They only output hold or look at. In our work, the attention type and contact type are separate. So our PredCLS numbers are higher than the numbers in AG.

  3. I have no idea about the numbers in "Detecting human-object relationships in videos"... You see, Ji released quite different numbers for the same baselines in different papers. I think they used different evaluation settings. You'd better ask Ji for more details.