Closed attardi closed 3 years ago
Sorry, it's my fault. The issue has been fixed in the dev branch:
batch_size, seq_len, fix_len = subwords.shape
mask = subwords.ne(self.pad_index)
lens = mask.sum((1, 2))
# [batch_size, n_subwords]
subwords = pad_sequence(subwords[mask].split(lens.tolist()), True)
bert_mask = pad_sequence(mask[mask].split(lens.tolist()), True)
if self.max_len and subwords.shape[1] > self.max_len:
raise RuntimeError(f"Token indices sequence length is longer than the specified max length "
f"({subwords.shape[1]} > {self.max_len})")
# return the hidden states of all layers
bert = self.bert(subwords, attention_mask=bert_mask.float())[-1]
I don't think that raising an exception is a solution: the parser should parse all sentences. I can't run the IWPT benchmarks if it fails. We already discussed this issue and I thought it could be solved by some other way (e.g. splitting the long sentences).
With transformers 2.11.0 it works. The sentence that fails is 349 token long and both tokenizers 2.11.0 and 3.1.0 produce the same subword sequence.
So the difference is in the code of modeling_bert.py, which in version 2.11.0 reads:
if position_ids is None:
position_ids = torch.arange(seq_length, dtype=torch.long, device=device)
position_ids = position_ids.unsqueeze(0).expand(input_shape)
Putting this in version 3.1.0 it works. I am not sure whether the embeddings will be correct though.
Actually max_len should be taken from the model config:
if subwords.shape[1] > self.bert.config.max_position_embeddings:
Yeah, it's recently added in v3.1.0.
@attardi Done, with the recipe borrowed from google-research/bert#27.
A problem arises with transformers 3.1.0, in certain languages (nl and ta).
In method BertEmbeddings.forward() from file:
position_ids is set as:
But it happens that seq_length = 583 while self.position_ids.shape = [1, 512] therefore position_ids gets truncated and then the following fails:
since inputs_embeds.shape = [1, 593]
This is due to the fact that self.position_embeddings is initialised with config.max_position_embeddings, which is set to 512.