position_ids - Githubissues

yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.

https://parser.yzhang.site/

MIT License

827 stars 139 forks source link

position_ids #44

Closed attardi closed 3 years ago

attardi commented 3 years ago

A problem arises with transformers 3.1.0, in certain languages (nl and ta).

In method BertEmbeddings.forward() from file:

transformers/modeling_bert.py

position_ids is set as:

    if position_ids is None:
        position_ids = self.position_ids[:, :seq_length]

But it happens that seq_length = 583 while self.position_ids.shape = [1, 512] therefore position_ids gets truncated and then the following fails:

    embeddings = inputs_embeds + position_embeddings + token_type_embeddings

since inputs_embeds.shape = [1, 593]

This is due to the fact that self.position_embeddings is initialised with config.max_position_embeddings, which is set to 512.

yzhangcs commented 3 years ago

Sorry, it's my fault. The issue has been fixed in the dev branch:

        batch_size, seq_len, fix_len = subwords.shape
        mask = subwords.ne(self.pad_index)
        lens = mask.sum((1, 2))
        # [batch_size, n_subwords]
        subwords = pad_sequence(subwords[mask].split(lens.tolist()), True)
        bert_mask = pad_sequence(mask[mask].split(lens.tolist()), True)
        if self.max_len and subwords.shape[1] > self.max_len:
            raise RuntimeError(f"Token indices sequence length is longer than the specified max length "
                               f"({subwords.shape[1]} > {self.max_len})")

        # return the hidden states of all layers
        bert = self.bert(subwords, attention_mask=bert_mask.float())[-1]

attardi commented 3 years ago

I don't think that raising an exception is a solution: the parser should parse all sentences. I can't run the IWPT benchmarks if it fails. We already discussed this issue and I thought it could be solved by some other way (e.g. splitting the long sentences).

With transformers 2.11.0 it works. The sentence that fails is 349 token long and both tokenizers 2.11.0 and 3.1.0 produce the same subword sequence.

So the difference is in the code of modeling_bert.py, which in version 2.11.0 reads:

    if position_ids is None:
        position_ids = torch.arange(seq_length, dtype=torch.long, device=device)
        position_ids = position_ids.unsqueeze(0).expand(input_shape)

Putting this in version 3.1.0 it works. I am not sure whether the embeddings will be correct though.

attardi commented 3 years ago

Actually max_len should be taken from the model config:

    if subwords.shape[1] > self.bert.config.max_position_embeddings:

yzhangcs commented 3 years ago

Yeah, it's recently added in v3.1.0.

yzhangcs commented 3 years ago

@attardi Done, with the recipe borrowed from google-research/bert#27.