Closed attardi closed 4 years ago
Yes, you are right. I didn't meet this since the code is only tested on BERT.
The defaultdict class cannot be pickled, therefore a more complex solution is needed.
Define a static class within BiaffineParser
class FieldVocab(dict):
"""Surrogate for missing vocab in certain Transformers tokenizers."""
def __init__(self, items, unk_token_id):
super(BiaffineParser.FieldVocab, self).__init__(items)
self.unk_token_id = unk_token_id
def __getitem__(self, tok):
return super(BiaffineParser.FieldVocab, self).get(tok, self.unk_token_id)
and the use it:
if hasattr(tokenizer, 'vocab'):
FEAT.vocab = tokenizer.vocab
else:
FEAT.vocab = BiaffineParser.FieldVocab(
{tokenizer._convert_id_to_token(i): i
for i in range(len(tokenizer))},
tokenizer.unk_token_id)
The utils.vocab.Vocab
works for pickle
Yes, maybe because it is not local? I was getting this error
AttributeError: Can't pickle local object 'BiaffineParser.build.
Maybe is cleaner to add the class to utils/vocab.py:
class FieldVocab(dict):
"""Surrogate for missing vocab in certain Transformers tokenizers."""
def __init__(self, unk_token_id, items):
super(FieldVocab, self).__init__(items)
self.unk_token_id = unk_token_id
def __getitem__(self, tok):
return super(FieldVocab, self).get(tok, self.unk_token_id)
and import it in biaffine_parser.py:
from supar.utils.vocab import FieldVocab
and use it like this:
FEAT.vocab = FieldVocab(tokenizer.unk_token_id,
{tokenizer._convert_id_to_token(i): i
for i in range(len(tokenizer))})
I think it may be more elegant to integrate it into Vocab
.
OK.
BTW. There are problems with the use of param n_bert_layer = 0
It should be:
self.n_layers = n_layers if n_layers != 0 else self.bert.config.num_hidden_layers
self.hidden_size = self.bert.config.hidden_size
self.n_out = n_out if n_out != 0 else self.hidden_size
self.pad_index = pad_index
self.requires_grad = requires_grad
self.scalar_mix = ScalarMix(self.n_layers, dropout)
if self.hidden_size != self.n_out:
self.projection = nn.Linear(self.hidden_size, self.n_out, False)
in BertEmbedding.init()
OMG.
In biaffine_parser.py, method build(), the test
if args.bert.startswith('bert'):
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token
does not work. There are many bert models whose name does not start with 'bert'. For example: TurkuNLP/bert-base-finnish-cased-v1, dbmdz/bert-base-italian-xxl-case, etc. Please revert to
FEAT = SubwordField('bert',
pad=tokenizer.pad_token,
unk=tokenizer.unk_token,
bos=tokenizer.cls_token, # bos_token is None in many trasformers
fix_len=args.fix_len,
tokenize=tokenizer.tokenize)
Since WORD is created with bos set to the constant '
Alternatively you may define:
bos_token = tokenizer.bos_token or tokenizer.cls_token
and use that in bos=bos_token, without messing with the field of the tokeniser object. I spent all afternoon chasing this bug.
Please pull the update. The cls_token has been replaced.
張宇 Soochow University
From: Giuseppe Attardi notifications@github.com Sent: Monday, June 15, 2020 10:03:44 AM To: yzhangcs/parser parser@noreply.github.com Cc: 張宇 yzhang.cs@outlook.com; Comment comment@noreply.github.com Subject: Re: [yzhangcs/parser] token (#33)
OMG. In biaffine_parser.py, method build(), revert to
FEAT = SubwordField('bert',
pad=tokenizer.pad_token,
unk=tokenizer.unk_token,
bos=tokenizer.cls_token, # bos_token is None
fix_len=args.fix_len,
tokenize=tokenizer.tokenize)
Since WORD is created with bos set to the constant '', it caused the vectors of words and feats to have different length.
I spent all afternoon chasing this bug.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/yzhangcs/parser/issues/33#issuecomment-643863800, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEMMYK3VJZZWHGF77Q4GJ4DRWV6QBANCNFSM4N5MXB4A.
Hi, the bug is fixed right now. I have tested it on BERT
, XLNet
and RoBERTa
.
Maybe you can try it out.
BTW, why did you add this line
args.feat_pad_index = FEAT.pad_index
in your pull request?
We need to use a defaultdict in biaffine_parser.py
because I run into a token which is split like this ['▁http', '://', 'www', '.', 'ib', 'rae', '.', 'ac', '.', 'ru', '/', 'ib', 'rae', '/', 'eng', '/', 'cher', 'no', 'by', 'l', \ '/', 'nat', '', 're', 'p', '/', 'nat', '', 're', 'pe', '.', 'ht', 'm', '#', '24']
and '_' is not present in the vocabulary.