wjun0830 / QD-DETR

Official pytorch repository for "QD-DETR : Query-Dependent Video Representation for Moment Retrieval and Highlight Detection" (CVPR 2023 Paper)
https://arxiv.org/abs/2303.13874
Other
183 stars 13 forks source link

Question Regarding Feature Extraction Discrepancy Between Training & Inference #7

Closed rsomani95 closed 1 year ago

rsomani95 commented 1 year ago

Hello, congratulations and thank you for sharing this awesome project!

I am cross posting a question from the Moment-DETR repo: https://github.com/jayleicn/moment_detr/issues/26

In the paper and training code, it seems that both SlowFast and CLIP video features are used. But during inference time, it looks like only CLIP features are being used (based on the run.py file).

Am I understanding this correctly? If yes, what is the cause for this discrepancy?

wjun0830 commented 1 year ago

If you are talking about the file in the run video directory, that is the file for the custom dataset. And we are very sorry but since we haven't used that file for the inference, you might get better answer from the authors of Moment DETR.

Besides, for the benchmarking experiments, you do not need to run run.py.

rsomani95 commented 1 year ago

@wjun0830 thanks for your response.

My question was more general -- for running inference on any file "in the wild", do you have to extract features using SlowFast and CLIP? Based on what you said above, it seems like you do, but since the run.py file only extracted CLIP features, I wanted to confirm. I will also wait for the authors of Moment-DETR to respond and let you know once they do.

Do you plan on adding a script to run inference on any file in the wild? I think it would be really helpful for someone (like me :D) trying to see if this can be used in production.

I'm also curious to know if you ever considered training a model that doesn't use SlowFast features at all, and only works with CLIP features? Do you think that could work? It would make usage in edge devices way more feasible!

wjun0830 commented 1 year ago

Sorry but for the wild videos, currently, we are not planning any further works.

And for using only Clip features, I believe it will definitely work but there is a chance of performance degradation.

rsomani95 commented 1 year ago

Got it. The author of Moment-DETR also responded re. using only CLIP features:

I don't have the exact number as well, as far as I remember, the CLIP only model achieves at least 90-95% of the CLIP+SlowFast model performance, so it is also a very decent model.

Source


for the wild videos, currently, we are not planning any further works

Would you be open to reviewing a PR if I try to put together some inference code?

wjun0830 commented 1 year ago

How about linking your work in this repo, once you open your repo? Since we haven't tried using wild datasets, that may be a better choice.

rsomani95 commented 1 year ago

Sure, will keep you posted. Closing this for now. Thanks again for your responses.

pribadihcr commented 2 months ago

Got it. The author of Moment-DETR also responded re. using only CLIP features:

I don't have the exact number as well, as far as I remember, the CLIP only model achieves at least 90-95% of the CLIP+SlowFast model performance, so it is also a very decent model.

Source

for the wild videos, currently, we are not planning any further works

Would you be open to reviewing a PR if I try to put together some inference code?

Hi @rsomani95 , do you have a repo for the inference code already?. Thanks