Detection score of the segment

zhouzhou-zheng commented 4 months ago

When testing, I input a 150s video into the model.

Test Scenario 1: The input video is of a woman dancing, and the query text is "a woman is dancing." The model correctly detects the corresponding segment, which meets expectations.

Test Scenario 2: The input video does not contain any clips of a woman dancing; it is just a video of a woman sitting on a chair. The query text is "a woman is dancing," yet the model still detects a corresponding segment, which does not meet expectations.

Test Scenario 3: The input is a combination of videos from Scenario 1 and Scenario 2. The query text is "a man is playing basketball." There are no men or basketball scenes in the video, but among the top 10 results, there are still segments with high scores.

My question is, for a test video and a query text, is there always a highly scored positive segment detected? What is the reason for this phenomenon? Is it because during your training, each video always has at least one segment that corresponds to the query text as a positive example?

wjun0830 commented 4 months ago

When testing, I input a 150s video into the model.

Test Scenario 1: The input video is of a woman dancing, and the query text is "a woman is dancing." The model correctly detects the corresponding segment, which meets expectations.

Test Scenario 2: The input video does not contain any clips of a woman dancing; it is just a video of a woman sitting on a chair. The query text is "a woman is dancing," yet the model still detects a corresponding segment, which does not meet expectations.

Test Scenario 3: The input is a combination of videos from Scenario 1 and Scenario 2. The query text is "a man is playing basketball." There are no men or basketball scenes in the video, but among the top 10 results, there are still segments with high scores.

My question is, for a test video and a query text, is there always a highly scored positive segment detected? What is the reason for this phenomenon? Is it because during your training, each video always has at least one segment that corresponds to the query text as a positive example?

Great point! the mentioned problem isnt addressed in this work. You may want to try our new model https://github.com/wjun0830/CGDETR This work may partially address the mentioned problem!

wjun0830 commented 4 months ago

And we agree with your opinion that the problem is because the input videos always include video segments that correspond to text queries.

wjun0830 / QD-DETR

Detection score of the segment #42