zjr2000 / GVL

Official implementation for paper Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
https://arxiv.org/abs/2303.06378
MIT License
25 stars 6 forks source link