yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
827 stars 139 forks source link

Is the Off-the-shelf Semantic Dependency Parser available? #61

Closed Fantabulous-J closed 3 years ago

Fantabulous-J commented 3 years ago

Hi, thanks for sharing such amazing work. May I ask if the off-the-shelf semantic dependency parser available? Or if it's not available, how can I train from scratch by myself?

yzhangcs commented 3 years ago

@Fantabulous-J Sorry, not available yet. Here is an example training cmd for training the model from scratch.

nohup python -u -m supar.cmds.biaffine_semantic_dependency train -b -d 0 -p model > model.train.log.verbose 2>&1 &
Fantabulous-J commented 3 years ago

Thanks for your reply. Is the model trained on SemEval 2015 Task 18 English datasets?

yzhangcs commented 3 years ago

@Fantabulous-J Yeah, the same dataset as used in Dozat et al. 2018 and Wang et al. 2019.

Fantabulous-J commented 3 years ago

Could you please tell me how to get the training data? It seems that I can't find any information about datasets on the official website and I am not sure whether this dataset is publicly accessible.

yzhangcs commented 3 years ago

@Fantabulous-J My data comes from LDC. Not sure if it is publicly available on the Internet.

Fantabulous-J commented 3 years ago

@yzhangcs I have got the dataset from LDC and is the file "en.dm.sdp" used in the model training? How should I covert it to train.conllu et al?

yzhangcs commented 3 years ago

@Fantabulous-J This is a toy example

#20001001
1   Pierre  Pierre  _   NNP _   2   nn  _   _
2   Vinken  _generic_proper_ne_ _   NNP _   9   nsubj   1:compound|6:ARG1|9:ARG1    _
3   ,   _   _   ,   _   2   punct   _   _
4   61  _generic_card_ne_   _   CD  _   5   num _   _
5   years   year    _   NNS _   6   npadvmod    4:ARG1  _
6   old old _   JJ  _   2   amod    5:measure   _
7   ,   _   _   ,   _   2   punct   _   _
8   will    will    _   MD  _   9   aux _   _
9   join    join    _   VB  _   0   root    0:root|12:ARG1|17:loc   _
10  the the _   DT  _   11  det _   _
11  board   board   _   NN  _   9   dobj    9:ARG2|10:BV    _
12  as  as  _   IN  _   9   prep    _   _
13  a   a   _   DT  _   15  det _   _
14  nonexecutive    _generic_jj_    _   JJ  _   15  amod    _   _
15  director    director    _   NN  _   12  pobj    12:ARG2|13:BV|14:ARG1   _
16  Nov.    Nov.    _   NNP _   9   tmod    _   _
17  29  _generic_dom_card_ne_   _   CD  _   16  num 16:of   _
18  .   _   _   .   _   9   punct   _   _

#20001002
1   Mr. Mr. _   NNP _   2   nn  _   _
2   Vinken  _generic_proper_ne_ _   NNP _   4   nsubj   1:compound|3:ARG1   _
3   is  is  _   VBZ _   4   cop 0:root  _
4   chairman    chairman    _   NN  _   0   root    3:ARG2|5:ARG1   _
5   of  of  _   IN  _   4   prep    _   _
6   Elsevier    _generic_proper_ne_ _   NNP _   7   nn  5:ARG2|7:compound|12:appos  _
7   N.V.    N.V.    _   NNP _   5   pobj    _   _
8   ,   _   _   ,   _   7   punct   _   _
9   the the _   DT  _   12  det _   _
10  Dutch   Dutch   _   JJ  _   12  nn  _   _
11  publishing  publish _   NN  _   12  amod    _   _
12  group   group   _   NN  _   7   appos   9:BV|10:ARG1|11:compound    _
13  .   _   _   .   _   4   punct   _   _

You should merge lines with the same ID into a single one, in which the anwsers are separated by '|'.

Fantabulous-J commented 3 years ago

@yzhangcs Thanks for answering my question. Do you have any recommended preprocessing code? Did you use the code in https://github.com/tdozat/Parser-v3?

yzhangcs commented 3 years ago

@Fantabulous-J The scripts can be found here.