Question about final prediction

Hi! Thanks for your great work! Take the visual encoder as an example, it seems that after the block named Self-CATT, we will get Z_hat and X_hat. Z_hat and X_hat are both in shape [batch_size, num_patches, feat_dim]. For there is a dimension representing num_patches, I wonder how to use the concatenation result (shape: [batch_size, num_patches, feat_dim]) for final prediction if the tag 'task_obj_predict' is set True in /src/lxrt/modeling.py? It seems that the dimension representing num_patches still exists after the forward function of class BertPredictionHeadTransform.

yangxuntu / lxmertcatt

Question about final prediction #11