About boxes in video - Githubissues

czy-orange commented 2 years ago

Hi, @ylsung Thanks for your sharing the code. I would like to know why boxes in video are all set as [0, 0, 0, 0] rather than [0, 0, 1, 1]. Any motivation for this setting? In my view, [0, 0, 1, 1] seems a more rational choice, as the whole frame is taken from video.

ylsung commented 2 years ago

The box information is used in original implementation of VLT5, where the inputs are objects extracted from Faster RCNN. Those box information indicates the position of object. However, in this repo, I used CLIP to extract the representation of an image/video, so there are no such box concept (because here are no objects) for the representation. We only use the position embedding at here to indicate the position of patches/videos. I set all the boxes to zero just to disable the object position embeddings.

czy-orange commented 2 years ago

Thanks for helping me understand this part and the point is clear.

ylsung / VL_adapter

About boxes in video #9