wtangdev / UniRel

released code for our EMNLP22 paper: UniRel: Unified Representation and Interaction for Joint Relational Triple Extraction
Apache License 2.0
82 stars 17 forks source link

Error while training model in modify_bert #8

Closed anushkasw closed 1 year ago

anushkasw commented 1 year ago

Hello, I am trying to train the model on the NYT dataset. I am getting the following error:

File "/UniRel/model/modify_bert.py", line 311, in forward embeddings = inputs_embeds + token_type_embeddings RuntimeError: The size of tensor a (138) must match the size of tensor b (131) at non-singleton dimension 1

Can anybody help me out with why this might be happening?

wtangdev commented 1 year ago

Hi! It looks like the dimensions of inputs_embeds and token_type_embeddings not match. Do you checked the length of input_ids and token_type_ids after tokenization in dataset.py?

anushkasw commented 1 year ago

Yes, one of the relation representation strings I was using was getting word-piece tokenized by the tokenizer and causing the dimensions mismatch.

So is there any way to avoid this? We can't know beforehand if the tokenizer will split a particular word. Also, is there any way to use multi-word relation representations?

wtangdev commented 1 year ago

Indeed, multi-word relations can be somewhat challenging; however, several straightforward approaches can be employed to achieve desired results. One such approach involves constraining the length of all relations to a specific value (during tokenization), followed by the implementation of pooling methods (e.g., average pooling) to standardize the length of the relation input embeddings. Alternatively, a decoding strategy that does not require pooling can be devised to attain the desired outcome.

You can try this approaches first, feel free to ask any further questions.

anushkasw commented 1 year ago

Oh ok. That makes sense. Thanks for the idea!

skyWalker1997 commented 1 year ago

I met the same error, the inputs_embeds' dimension not match with token_type_embeddings', on the nyt dataset from google drive provided by author.

By the way, I also met the same question about the "list index out of range " when process the data in line "self.tokenizer.decode(input_ids[t_e])", I delete all error data to go through.

@anushkasw @wtangdev Could you please tell me how to solve it?