Closed anushkasw closed 1 year ago
Hi! It looks like the dimensions of inputs_embeds
and token_type_embeddings
not match. Do you checked the length of input_ids
and token_type_ids
after tokenization in dataset.py
?
Yes, one of the relation representation strings I was using was getting word-piece tokenized by the tokenizer and causing the dimensions mismatch.
So is there any way to avoid this? We can't know beforehand if the tokenizer will split a particular word. Also, is there any way to use multi-word relation representations?
Indeed, multi-word relations can be somewhat challenging; however, several straightforward approaches can be employed to achieve desired results. One such approach involves constraining the length of all relations to a specific value (during tokenization), followed by the implementation of pooling methods (e.g., average pooling) to standardize the length of the relation input embeddings. Alternatively, a decoding strategy that does not require pooling can be devised to attain the desired outcome.
You can try this approaches first, feel free to ask any further questions.
Oh ok. That makes sense. Thanks for the idea!
I met the same error, the inputs_embeds' dimension not match with token_type_embeddings', on the nyt dataset from google drive provided by author.
By the way, I also met the same question about the "list index out of range " when process the data in line "self.tokenizer.decode(input_ids[t_e])", I delete all error data to go through.
@anushkasw @wtangdev Could you please tell me how to solve it?
Hello, I am trying to train the model on the NYT dataset. I am getting the following error:
File "/UniRel/model/modify_bert.py", line 311, in forward embeddings = inputs_embeds + token_type_embeddings RuntimeError: The size of tensor a (138) must match the size of tensor b (131) at non-singleton dimension 1
Can anybody help me out with why this might be happening?