yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
829 stars 141 forks source link

token #33

Closed attardi closed 4 years ago

attardi commented 4 years ago

We need to use a defaultdict in biaffine_parser.py

            if hasattr(tokenizer, 'vocab'):
                FEAT.vocab = tokenizer.vocab
            else:
                from collections import defaultdict
                FEAT.vocab = defaultdict(lambda: tokenizer.unk_token_id,
                                         {tokenizer._convert_id_to_token(i): i for i in range(len(tokenizer))})

because I run into a token which is split like this ['▁http', '://', 'www', '.', 'ib', 'rae', '.', 'ac', '.', 'ru', '/', 'ib', 'rae', '/', 'eng', '/', 'cher', 'no', 'by', 'l', \ '/', 'nat', '', 're', 'p', '/', 'nat', '', 're', 'pe', '.', 'ht', 'm', '#', '24']

and '_' is not present in the vocabulary.

yzhangcs commented 4 years ago

Yes, you are right. I didn't meet this since the code is only tested on BERT.

attardi commented 4 years ago

The defaultdict class cannot be pickled, therefore a more complex solution is needed.

Define a static class within BiaffineParser

class FieldVocab(dict):
    """Surrogate for missing vocab in certain Transformers tokenizers."""
    def __init__(self, items, unk_token_id):
        super(BiaffineParser.FieldVocab, self).__init__(items)
        self.unk_token_id = unk_token_id

    def __getitem__(self, tok):
        return super(BiaffineParser.FieldVocab, self).get(tok, self.unk_token_id)

and the use it:

    if hasattr(tokenizer, 'vocab'):
                FEAT.vocab = tokenizer.vocab
            else:
                FEAT.vocab = BiaffineParser.FieldVocab(
                    {tokenizer._convert_id_to_token(i): i
                     for i in range(len(tokenizer))},
        tokenizer.unk_token_id)
yzhangcs commented 4 years ago

The utils.vocab.Vocab works for pickle

attardi commented 4 years ago

Yes, maybe because it is not local? I was getting this error

AttributeError: Can't pickle local object 'BiaffineParser.build..defaultdict'

attardi commented 4 years ago

Maybe is cleaner to add the class to utils/vocab.py:

class FieldVocab(dict):
   """Surrogate for missing vocab in certain Transformers tokenizers."""
   def __init__(self, unk_token_id, items):
       super(FieldVocab, self).__init__(items)
       self.unk_token_id = unk_token_id

   def __getitem__(self, tok):
       return super(FieldVocab, self).get(tok, self.unk_token_id)

and import it in biaffine_parser.py:

from supar.utils.vocab import FieldVocab

and use it like this:

                FEAT.vocab = FieldVocab(tokenizer.unk_token_id,
                                        {tokenizer._convert_id_to_token(i): i
                                         for i in range(len(tokenizer))})
yzhangcs commented 4 years ago

I think it may be more elegant to integrate it into Vocab.

attardi commented 4 years ago

OK.

BTW. There are problems with the use of param n_bert_layer = 0

It should be:

    self.n_layers = n_layers if n_layers != 0 else self.bert.config.num_hidden_layers
    self.hidden_size = self.bert.config.hidden_size
    self.n_out = n_out if n_out != 0 else self.hidden_size
    self.pad_index = pad_index
    self.requires_grad = requires_grad

    self.scalar_mix = ScalarMix(self.n_layers, dropout)
    if self.hidden_size != self.n_out:
        self.projection = nn.Linear(self.hidden_size, self.n_out, False)

in BertEmbedding.init()

attardi commented 4 years ago

OMG.

In biaffine_parser.py, method build(), the test

    if args.bert.startswith('bert'):
                tokenizer.bos_token = tokenizer.cls_token
            tokenizer.eos_token = tokenizer.sep_token

does not work. There are many bert models whose name does not start with 'bert'. For example: TurkuNLP/bert-base-finnish-cased-v1, dbmdz/bert-base-italian-xxl-case, etc. Please revert to

    FEAT = SubwordField('bert',
                            pad=tokenizer.pad_token,
                                unk=tokenizer.unk_token,
                                bos=tokenizer.cls_token, # bos_token is None in many trasformers                           
                                fix_len=args.fix_len,
                                tokenize=tokenizer.tokenize)

Since WORD is created with bos set to the constant '', this caused the vectors of words and feats to have different length.

Alternatively you may define:

            bos_token = tokenizer.bos_token or tokenizer.cls_token  

and use that in bos=bos_token, without messing with the field of the tokeniser object. I spent all afternoon chasing this bug.

yzhangcs commented 4 years ago

Please pull the update. The cls_token has been replaced.

張宇 Soochow University


From: Giuseppe Attardi notifications@github.com Sent: Monday, June 15, 2020 10:03:44 AM To: yzhangcs/parser parser@noreply.github.com Cc: 張宇 yzhang.cs@outlook.com; Comment comment@noreply.github.com Subject: Re: [yzhangcs/parser] token (#33)

OMG. In biaffine_parser.py, method build(), revert to

    FEAT = SubwordField('bert',
                        pad=tokenizer.pad_token,
                            unk=tokenizer.unk_token,
                            bos=tokenizer.cls_token, # bos_token is None
                            fix_len=args.fix_len,
                            tokenize=tokenizer.tokenize)

Since WORD is created with bos set to the constant '', it caused the vectors of words and feats to have different length.

I spent all afternoon chasing this bug.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/yzhangcs/parser/issues/33#issuecomment-643863800, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEMMYK3VJZZWHGF77Q4GJ4DRWV6QBANCNFSM4N5MXB4A.

yzhangcs commented 4 years ago

Hi, the bug is fixed right now. I have tested it on BERT, XLNet and RoBERTa. Maybe you can try it out. BTW, why did you add this line

args.feat_pad_index = FEAT.pad_index

in your pull request?