Closed kimihailv closed 1 year ago
Hello. Could you please explain how patch features from ViT are aggregated to one specific region feature? This point is confusing, because a region doesn't necessarily contain one or several whole patches.
Hi, sorry for my late reply. I use image_atts to indicate a region: https://github.com/zengyan-97/X-VLM/blob/master/models/model_pretrain.py#L14
Hello. Could you please explain how patch features from ViT are aggregated to one specific region feature? This point is confusing, because a region doesn't necessarily contain one or several whole patches.