Hi! Thanks for your great work!
Take the visual encoder as an example, it seems that after the block named Self-CATT, we will get Z_hat and X_hat. Z_hat and X_hat are both in shape [batch_size, num_patches, feat_dim]. For there is a dimension representing num_patches, I wonder how to use the concatenation result (shape: [batch_size, num_patches, feat_dim]) for final prediction if the tag 'task_obj_predict' is set True in /src/lxrt/modeling.py? It seems that the dimension representing num_patches still exists after the forward function of class BertPredictionHeadTransform.
Hi! Thanks for your great work! Take the visual encoder as an example, it seems that after the block named Self-CATT, we will get Z_hat and X_hat. Z_hat and X_hat are both in shape [batch_size, num_patches, feat_dim]. For there is a dimension representing num_patches, I wonder how to use the concatenation result (shape: [batch_size, num_patches, feat_dim]) for final prediction if the tag 'task_obj_predict' is set True in /src/lxrt/modeling.py? It seems that the dimension representing num_patches still exists after the forward function of class BertPredictionHeadTransform.