zdou0830 / METER

METER: A Multimodal End-to-end TransformER Framework
https://arxiv.org/abs/2111.02387
MIT License
361 stars 31 forks source link

Inference with Fine-tuned SNLI Model #29

Closed sramshetty closed 2 years ago

sramshetty commented 2 years ago

Hi,

Thank you for the great work and the fine-tuned models, but I just wanted to ask how I should go about running inference with the fine-tuned model. Currently, I run into this error in my notebook:

1 model = METERTransformerSS(cfg)
----> 2 model.load_state_dict(torch.load("/content/meter_clip16_288_roberta_snli.ckpt")['state_dict'])

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in load_state_dict(self, state_dict, strict)
   1050         if len(error_msgs) > 0:
   1051             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
-> 1052                                self.__class__.__name__, "\n\t".join(error_msgs)))
   1053         return _IncompatibleKeys(missing_keys, unexpected_keys)
   1054 

RuntimeError: Error(s) in loading state_dict for METERTransformerSS:
    Unexpected key(s) in state_dict: "vit_model.token_embedding.weight". 
    size mismatch for vit_model.visual.positional_embedding: copying a param with shape torch.Size([577, 768]) from checkpoint, the shape in current model is torch.Size([197, 768]).

I wonder if this is due to how I configure the model or not, is there a specific way I should create the config for inference? Thank you in advance.

zdou0830 commented 2 years ago

Hi, this error is due to the image_size being different from the image_size of your loaded model. You can check the README for details

sramshetty commented 2 years ago

Hi,

Yeah I was able to get inference working on by correcting my model initialization, thank you though. However, I do have a question regarding the fine tuning of the model. Was the fine-tuning done with the image-hypothesis pair or was the given caption also used? I know in OFA they concatenate the caption and hypothesis but it doesn't look like that is done here, is that correct? I may be misunderstanding but hopefully my question is still clear. Thank you!

zdou0830 commented 2 years ago

Hello, the captions should NOT be used in visual entailment. Here is a table from the CoCa paper.

image
sramshetty commented 2 years ago

Awesome, thank you for the prompt response.