ylsung / VL_adapter

PyTorch code for "VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks" (CVPR2022)
MIT License
204 stars 16 forks source link

About boxes in video #9

Closed czy-orange closed 2 years ago

czy-orange commented 2 years ago

Hi, @ylsung Thanks for your sharing the code. I would like to know why boxes in video are all set as [0, 0, 0, 0] rather than [0, 0, 1, 1]. Any motivation for this setting? In my view, [0, 0, 1, 1] seems a more rational choice, as the whole frame is taken from video.

ylsung commented 2 years ago

The box information is used in original implementation of VLT5, where the inputs are objects extracted from Faster RCNN. Those box information indicates the position of object. However, in this repo, I used CLIP to extract the representation of an image/video, so there are no such box concept (because here are no objects) for the representation. We only use the position embedding at here to indicate the position of patches/videos. I set all the boxes to zero just to disable the object position embeddings.

czy-orange commented 2 years ago

Thanks for helping me understand this part and the point is clear.