yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
827 stars 139 forks source link

Doubts about the input format when training a SDP model. #85

Closed MinionAttack closed 2 years ago

MinionAttack commented 2 years ago

Hi,

I'm trying to train a SDP model and reading the Usage section of the README.md:

>>> sdp = Parser.load('biaffine-sdp-en')
>>> sdp.predict([[('I','I','PRP'), ('saw','see','VBD'), ('Sarah','Sarah','NNP'), ('with','with','IN'),
                  ('a','a','DT'), ('telescope','telescope','NN'), ('.','_','.')]],
                verbose=False)[0]
1       I       I       PRP     _       _       _       _       2:ARG1  _
2       saw     see     VBD     _       _       _       _       0:root|4:ARG1   _
3       Sarah   Sarah   NNP     _       _       _       _       2:ARG2  _
4       with    with    IN      _       _       _       _       _       _
5       a       a       DT      _       _       _       _       _       _
6       telescope       telescope       NN      _       _       _       _       4:ARG2|5:BV     _
7       .       _       .       _       _       _       _       _       _

The 9th column is 0:root|4:ARG1. I'm using the UD CoNLL-U files for English (EWT) and the 9th column is like 21:nmod:near so if I try to train a SDP model I get the error:

  File "/home/iago/SuPar/supar/utils/field.py", line 359, in <genexpr>
    for row in self.preprocess(chart)
  File "/home/iago/SuPar/supar/utils/field.py", line 171, in preprocess
    sequence = self.fn(sequence)
  File "/home/iago/SuPar/supar/utils/transform.py", line 177, in get_labels
    edge, label = pair.split(':')
ValueError: too many values to unpack (expected 2)

Because I think it expects something like 0:root|4:ARG1 instead of 21:nmod:near. Does SuPar have a function to transform the UD CoNLL-U files to that format? I'm trying to train the model through the command line, not through code with:

python -m supar.cmds.biaffine_sdp train --build --device 0 --conf config/biaffine.sdp.ini \
    --n-embed 300 --encoder bert --unk '' \
    --embed data/Embeddings/English/cc.en.300.vec \
    --train data/Corpus/English-EWT/en_ewt-ud-train.conllu \
    --dev data/Corpus/English-EWT/en_ewt-ud-dev.conllu \
    --test data/Corpus/English-EWT/en_ewt-ud-test.conllu \
    --path models/English-EWT/Model_1

Regards.

yzhangcs commented 2 years ago

@MinionAttack Hi, thank you for reporting this bug, I've fixed it in the latest commit, please check it out.

MinionAttack commented 2 years ago

Hi @yzhangcs , there is still a bug:

For sentences like this:

# sent_id = weblog-blogspot.com_healingiraq_20040409053012_ENG_20040409_053012-0022
# text = Over 300 Iraqis are reported dead and 500 wounded in Fallujah alone.
1   Over    over    ADV RB  _   2   advmod  2:advmod    _
2   300 300 NUM CD  NumType=Card    3   nummod  3:nummod    _
3   Iraqis  Iraqi   PROPN   NNPS    Number=Plur 5   nsubj:pass  5:nsubj:pass|6:nsubj:xsubj|8:nsubj:pass _
4   are be  AUX VBP Mood=Ind|Tense=Pres|VerbForm=Fin    5   aux:pass    5:aux:pass  _
5   reported    report  VERB    VBN Tense=Past|VerbForm=Part|Voice=Pass 0   root    0:root  _
6   dead    dead    ADJ JJ  Degree=Pos  5   xcomp   5:xcomp _
7   and and CCONJ   CC  _   8   cc  8:cc|8.1:cc _
8   500 500 NUM CD  NumType=Card    5   conj    5:conj:and|8.1:nsubj:pass|9:nsubj:xsubj _
8.1 reported    report  VERB    VBN Tense=Past|VerbForm=Part|Voice=Pass _   _   5:conj:and  CopyOf=5
9   wounded wounded ADJ JJ  Degree=Pos  8   orphan  8.1:xcomp   _
10  in  in  ADP IN  _   11  case    11:case _
11  Fallujah    Fallujah    PROPN   NNP Number=Sing 5   obl 5:obl:in    _
12  alone   alone   ADV RB  _   11  advmod  11:advmod   SpaceAfter=No
13  .   .   PUNCT   .   _   5   punct   5:punct _

When you cast to int() here:

for pair in s.split('|'):
     edge, label = pair.split(':', 1)
     labels[i][int(edge)] = label

Throws an ValueError: invalid literal for int() with base 10: '8.1' because the pair is 8.1:cc and it's a float not an int.

Regards.

EDIT: I'm trying with the sentence that is provided in the README.md (using the same for train, dev and test):

1       I       I       PRP     _       _       _       _       2:ARG1  _
2       saw     see     VBD     _       _       _       _       0:root|4:ARG1   _
3       Sarah   Sarah   NNP     _       _       _       _       2:ARG2  _
4       with    with    IN      _       _       _       _       _       _
5       a       a       DT      _       _       _       _       _       _
6       telescope       telescope       NN      _       _       _       _       4:ARG2|5:BV     _
7       .       _       .       _       _       _       _       _       _

And I get:


2021-10-25 11:23:06 INFO Loading the data
Traceback (most recent call last):
  File "/home/iago/.local/share/JetBrains/IntelliJIdea2021.2/python/helpers/pydev/pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/iago/.local/share/JetBrains/IntelliJIdea2021.2/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/cmds/biaffine_sdp.py", line 43, in <module>
    main()
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/cmds/biaffine_sdp.py", line 39, in main
    parse(parser)
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/cmds/cmd.py", line 29, in parse
    parser.train(**args)
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/parsers/sdp.py", line 52, in train
    return super().train(**Config().update(locals()))
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/parsers/parser.py", line 41, in train
    train = Dataset(self.transform, args.train, **args).build(batch_size, buckets, True, dist.is_initialized())
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/utils/data.py", line 81, in build
    self.buckets = dict(zip(*kmeans([len(s.transformed[fields[0].name]) for s in self], n_buckets)))
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/utils/alg.py", line 45, in kmeans
    dists, y = torch.abs_(x.unsqueeze(-1) - c).min(-1)
IndexError: min(): Expected reduction dim 1 to have non-zero size.

EDIT 2:

I've got access to a SEMEVAL 2015 English corpus and after converting the SDP file to CoNLL-U with semstr, I can train a model.

yzhangcs commented 2 years ago

@MinionAttack Hi, I have encountered this problem before. I would suggest that the best way to deal with this is to delete these lines directly, as keeping them may cause displacement of arc annotations.

MinionAttack commented 2 years ago

Hi @yzhangcs, I didn't know I could remove those lines without causing problems. Thanks for the trick.

I'm doing some tests with a corpus and I'm encountering the same error described in EDIT 1 of my previous comment:

2021-10-25 15:38:41 INFO Epoch 1 / 10:
Traceback (most recent call last):
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/cmds/biaffine_sdp.py", line 43, in <module>
    main()
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/cmds/biaffine_sdp.py", line 39, in main
    parse(parser)
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/cmds/cmd.py", line 29, in parse
    parser.train(**args)
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/parsers/sdp.py", line 52, in train
    return super().train(**Config().update(locals()))
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/parsers/parser.py", line 74, in train
    self._train(train.loader)
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/parsers/sdp.py", line 151, in _train
    label_preds = self.model.decode(s_edge, s_label)
  File "/home/iago/Escritorio/SuPar-SharedTask/supar/models/sdp.py", line 220, in decode
    return s_label.argmax(-1).masked_fill_(s_edge.argmax(-1).lt(1), -1)
IndexError: argmax(): Expected reduction dim -1 to have non-zero size.

The train and dev files I am using contain both parent and non-parent sentences:

# sent_id = University_of_Phoenix_Online_151_08-04-2005-7
# text = The team is as good as you make it .
1   The the DET _   _   2   det _   _   _
2   team    team    NOUN    _   _   5   nsubj   _   _   9:targ
3   is  be  AUX _   _   5   cop _   _   _
4   as  as  ADV _   _   5   advmod  _   _   9:exp-Neutral
5   good    good    ADJ _   _   0   root    _   _   9:exp-Neutral
6   as  as  SCONJ   _   _   8   mark    _   _   9:exp-Neutral
7   you you PRON    _   _   8   nsubj   _   _   9:exp-Neutral
8   make    make    VERB    _   _   5   advcl   _   _   9:exp-Neutral
9   it  it  PRON    _   _   8   obj _   _   0:exp-Neutral
10  .   .   PUNCT   _   _   5   punct   _   _   _
# sent_id = University_of_Phoenix_Online_151_08-04-2005-8
# text = I have been in both good and bad .
1   I   I   PRON    _   _   6   nsubj   _   _   _
2   have    have    AUX _   _   6   aux _   _   _
3   been    be  AUX _   _   6   cop _   _   _
4   in  in  ADP _   _   6   case    _   _   _
5   both    both    CCONJ   _   _   6   cc:preconj  _   _   _
6   good    good    ADJ _   _   0   root    _   _   _
7   and and CCONJ   _   _   8   cc  _   _   _
8   bad bad ADJ _   _   6   conj    _   _   _
9   .   .   PUNCT   _   _   6   punct   _   _   _

Could that be the problem?

yzhangcs commented 2 years ago

@MinionAttack It's OK when testing.

MinionAttack commented 2 years ago

I am closing the topic because I think the errors I am getting are due to some misunderstanding and are not related to the initial question.