ylsung / VL_adapter

PyTorch code for "VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks" (CVPR2022)
MIT License
202 stars 15 forks source link

About boxes in video #9

Closed czy-orange closed 1 year ago

czy-orange commented 1 year ago

Hi, @ylsung Thanks for your sharing the code. I would like to know why boxes in video are all set as [0, 0, 0, 0] rather than [0, 0, 1, 1]. Any motivation for this setting? In my view, [0, 0, 1, 1] seems a more rational choice, as the whole frame is taken from video.

ylsung commented 1 year ago

The box information is used in original implementation of VLT5, where the inputs are objects extracted from Faster RCNN. Those box information indicates the position of object. However, in this repo, I used CLIP to extract the representation of an image/video, so there are no such box concept (because here are no objects) for the representation. We only use the position embedding at here to indicate the position of patches/videos. I set all the boxes to zero just to disable the object position embeddings.

czy-orange commented 1 year ago

Thanks for helping me understand this part and the point is clear.