yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
825 stars 138 forks source link

Which UD treebanks were used for training? #104

Closed zerogerc closed 2 years ago

zerogerc commented 2 years ago

Hi, I've read the following in the README:

The multilingual dependency parsing model, named biaffine-dep-xlmr, is trained on merged 12 selected treebanks from Universal Dependencies (UD) v2.3 dataset by finetuning [xlm-roberta-large](https://huggingface.co/xlm-roberta-large). The following table lists results of each treebank. Languages are represented by [ISO 639-1 Language Codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).

How could I know which treebanks were used for training and evaluation?

yzhangcs commented 2 years ago

@zerogerc Hi, the data used for training the model is a combination of training datasets for 12 languages selected from UD2.3. Then the model is evaluated on their respective dev/test sets.

zerogerc commented 2 years ago

@yzhangcs hi, but there are several treebanks for each language. Like ewt, gum or atis for english.

yzhangcs commented 2 years ago

@zerogerc I just follow this paper for data preprocessing, and here is their released data. Note that UD2.2 treebanks were adopted by the CoNLL18 shared task. Here I use UD2.3 instead.

zerogerc commented 2 years ago

I see, thanks for the help!