Closed czy-orange closed 2 years ago
The box information is used in original implementation of VLT5, where the inputs are objects extracted from Faster RCNN. Those box information indicates the position of object. However, in this repo, I used CLIP to extract the representation of an image/video, so there are no such box concept (because here are no objects) for the representation. We only use the position embedding at here to indicate the position of patches/videos. I set all the boxes to zero just to disable the object position embeddings.
Thanks for helping me understand this part and the point is clear.
Hi, @ylsung Thanks for your sharing the code. I would like to know why boxes in video are all set as [0, 0, 0, 0] rather than [0, 0, 1, 1]. Any motivation for this setting? In my view, [0, 0, 1, 1] seems a more rational choice, as the whole frame is taken from video.