Closed rsomani95 closed 1 year ago
If you are talking about the file in the run video directory, that is the file for the custom dataset. And we are very sorry but since we haven't used that file for the inference, you might get better answer from the authors of Moment DETR.
Besides, for the benchmarking experiments, you do not need to run run.py.
@wjun0830 thanks for your response.
My question was more general -- for running inference on any file "in the wild", do you have to extract features using SlowFast and CLIP? Based on what you said above, it seems like you do, but since the run.py
file only extracted CLIP features, I wanted to confirm. I will also wait for the authors of Moment-DETR to respond and let you know once they do.
Do you plan on adding a script to run inference on any file in the wild? I think it would be really helpful for someone (like me :D) trying to see if this can be used in production.
I'm also curious to know if you ever considered training a model that doesn't use SlowFast features at all, and only works with CLIP features? Do you think that could work? It would make usage in edge devices way more feasible!
Sorry but for the wild videos, currently, we are not planning any further works.
And for using only Clip features, I believe it will definitely work but there is a chance of performance degradation.
Got it. The author of Moment-DETR also responded re. using only CLIP features:
I don't have the exact number as well, as far as I remember, the CLIP only model achieves at least 90-95% of the CLIP+SlowFast model performance, so it is also a very decent model.
for the wild videos, currently, we are not planning any further works
Would you be open to reviewing a PR if I try to put together some inference code?
How about linking your work in this repo, once you open your repo? Since we haven't tried using wild datasets, that may be a better choice.
Sure, will keep you posted. Closing this for now. Thanks again for your responses.
Got it. The author of Moment-DETR also responded re. using only CLIP features:
I don't have the exact number as well, as far as I remember, the CLIP only model achieves at least 90-95% of the CLIP+SlowFast model performance, so it is also a very decent model.
for the wild videos, currently, we are not planning any further works
Would you be open to reviewing a PR if I try to put together some inference code?
Hi @rsomani95 , do you have a repo for the inference code already?. Thanks
Hello, congratulations and thank you for sharing this awesome project!
I am cross posting a question from the Moment-DETR repo: https://github.com/jayleicn/moment_detr/issues/26
In the paper and training code, it seems that both SlowFast and CLIP video features are used. But during inference time, it looks like only CLIP features are being used (based on the
run.py
file).Am I understanding this correctly? If yes, what is the cause for this discrepancy?