fail to load XVLMRetrieval checkpoint

Hi, Thanks for sharing your paper and code. I am trying to run Retrieval.py with the provided checkpoint "xvlm_beit_1b_large_stage2_coco_rerun.th" but it fails to load. It seems the model code is not consistent with the checkpoint. I get below error. Can you please check it? Thanks. Ofer

Error(s) in loading state_dict for XVLMForRetrieval: size mismatch for vision_encoder.blocks.0.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.0.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.1.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.1.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.2.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.2.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.3.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.3.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.4.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.4.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.5.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.5.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.6.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.6.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.7.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.7.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.8.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.8.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.9.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.9.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.10.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.10.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.11.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.11.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.12.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.12.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.13.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.13.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.14.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.14.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.15.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.15.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.16.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.16.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.17.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.17.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.18.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.18.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.19.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.19.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.20.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.20.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.21.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.21.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.22.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.22.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.23.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.23.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). File "/datasets1/ofer/X2-VLM/models/xvlm.py", line 612, in load_pretrained msg = self.load_state_dict(state_dict, strict=False) File "/datasets1/ofer/X2-VLM/test.py", line 303, in main model.load_pretrained(args.checkpoint, config, is_eval=args.evaluate) File "/datasets1/ofer/X2-VLM/test.py", line 388, in main(args, config, test_type) RuntimeError: Error(s) in loading state_dict for XVLMForRetrieval: size mismatch for vision_encoder.blocks.0.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.0.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.1.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.1.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.2.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.2.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.3.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.3.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.4.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.4.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.5.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.5.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.6.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.6.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.7.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.7.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.8.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.8.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.9.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.9.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.10.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.10.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.11.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.11.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.12.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.12.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.13.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.13.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.14.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.14.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.15.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.15.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.16.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.16.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.17.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.17.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.18.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.18.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.19.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.19.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.20.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.20.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.21.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.21.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.22.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.22.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]). size mismatch for vision_encoder.blocks.23.attn.relative_position_bias_table: copying a param with shape torch.Size([2212, 16]) from checkpoint, the shape in current model is torch.Size([732, 16]). size mismatch for vision_encoder.blocks.23.attn.relative_position_index: copying a param with shape torch.Size([577, 577]) from checkpoint, the shape in current model is torch.Size([197, 197]).

zengyan-97 / X2-VLM

fail to load XVLMRetrieval checkpoint #15