yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
825 stars 138 forks source link

Unable to train a SDP model with some UD Treebanks #90

Closed MinionAttack closed 2 years ago

MinionAttack commented 2 years ago

Hi, I'm trying to train some SDP models but I'm facing an issue.

If I try to train a SDP model with Basque-BDT or Norwegian-Bokmaal I get:

2022-01-21 17:24:00 INFO Epoch 1 / 200:
Traceback (most recent call last):           
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/iago/SuPar_Pre-finetuning/supar/cmds/biaffine_sdp.py", line 43, in <module>
    main()
  File "/home/iago/SuPar_Pre-finetuning/supar/cmds/biaffine_sdp.py", line 39, in main
    parse(parser)
  File "/home/iago/SuPar_Pre-finetuning/supar/cmds/cmd.py", line 29, in parse
    parser.train(**args)
  File "/home/iago/SuPar_Pre-finetuning/supar/parsers/sdp.py", line 52, in train
    return super().train(**Config().update(locals()))
  File "/home/iago/SuPar_Pre-finetuning/supar/parsers/parser.py", line 74, in train
    self._train(train.loader)
  File "/home/iago/SuPar_Pre-finetuning/supar/parsers/sdp.py", line 151, in _train
    label_preds = self.model.decode(s_edge, s_label)
  File "/home/iago/SuPar_Pre-finetuning/supar/models/sdp.py", line 220, in decode
    return s_label.argmax(-1).masked_fill_(s_edge.argmax(-1).lt(1), -1)
IndexError: argmax(): Expected reduction dim 3 to have non-zero size.

But if I train a SDP model with Catalan-AnCora or English-EWT (For this one I had to remove by hand all the sentences with XX.1) it works fine.

I use a command like this:

python -m supar.cmds.biaffine_sdp train --build --device 0 --conf config/Basque/xlm-roberta-large.ini --encoder bert --bert xlm-roberta-large --unk '' --train data/Corpus/Universal_Dependencies/Basque/BDT/train.conllu --dev data/Corpus/Universal_Dependencies/Basque/BDT/dev.conllu --test data/Corpus/Universal_Dependencies/Basque/BDT/test.conllu --path models/Universal_Dependencies/Basque/BDT/Model_xlm-roberta-large_1_baseline

Why is this error occurring in some languages and not in others? I'm using the latest version from master.

Regards.

yzhangcs commented 2 years ago

@MinionAttack Hi, please also make sure you have preprocessed the treebank to make it conform to Wang's format:

#20001001
1   Pierre  Pierre  _   NNP _   2   nn  _   _
2   Vinken  _generic_proper_ne_ _   NNP _   9   nsubj   1:compound|6:ARG1|9:ARG1    _
3   ,   _   _   ,   _   2   punct   _   _
4   61  _generic_card_ne_   _   CD  _   5   num _   _
5   years   year    _   NNS _   6   npadvmod    4:ARG1  _
6   old old _   JJ  _   2   amod    5:measure   _
7   ,   _   _   ,   _   2   punct   _   _
8   will    will    _   MD  _   9   aux _   _
9   join    join    _   VB  _   0   root    0:root|12:ARG1|17:loc   _
10  the the _   DT  _   11  det _   _
11  board   board   _   NN  _   9   dobj    9:ARG2|10:BV    _
12  as  as  _   IN  _   9   prep    _   _
13  a   a   _   DT  _   15  det _   _
14  nonexecutive    _generic_jj_    _   JJ  _   15  amod    _   _
15  director    director    _   NN  _   12  pobj    12:ARG2|13:BV|14:ARG1   _
16  Nov.    Nov.    _   NNP _   9   tmod    _   _
17  29  _generic_dom_card_ne_   _   CD  _   16  num 16:of   _
18  .   _   _   .   _   9   punct   _   _

IndexError: argmax(): Expected reduction dim 3 to have non-zero size.

Does it mean labels were not properly handled? You might wish to monitor the initialization of ROLE field and Biaffine layers.

MinionAttack commented 2 years ago

Hi, the problem was that some UD treebanks don't have the EUD (Enhanced UD) column (8th) so that's why it was failing when I tried to train with those languages.

In case someone in the future reads this, the solution is to change the content of column 8 with the content of column 6 and 7 separated by a colon.