Error loading the datasets

Dimiftb commented 3 years ago

Hi,

Thank you very much for your paper and your models. I'm attempting to replicate the experimental results in your paper on conll2003 and en-ontonotes. I'm currently faced with an error for both datasets, which I'm not sure how to go about solving. You can see the output of running python train.py below

Click to expand

``` 2021-07-08 14:43:47.895031: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 Traceback (most recent call last): File "BARTNER/train.py", line 131, in data_bundle, tokenizer, mapping2id = get_data() File "/usr/local/lib/python3.7/dist-packages/fastNLP/core/utils.py", line 357, in wrapper results = func(*args, **kwargs) File "BARTNER/train.py", line 123, in get_data data_bundle = pipe.process_from_file(paths, demo=demo) File "/content/BARTNER/data/pipe.py", line 206, in process_from_file data_bundle = Conll2003NERLoader(demo=demo).load(paths) File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/loader/loader.py", line 69, in load datasets = {name: self._load(path) for name, path in paths.items()} File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/loader/loader.py", line 69, in datasets = {name: self._load(path) for name, path in paths.items()} File "/content/BARTNER/data/pipe.py", line 271, in _load target = iob2(ins['target']) File "/usr/local/lib/python3.7/dist-packages/fastNLP/io/pipe/utils.py", line 30, in iob2 raise TypeError("The encoding schema is not a valid IOB type.") TypeError: The encoding schema is not a valid IOB type. ```

I'm running on colab.

As for conll2003, I've simply extracted the original files for English and have put them in a folder data/conll2003 as per your instructions.

As for ontonotes, to generate bio tags I've followed this repo: https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO and put the files in data/en-ontonotes/english/ as per instructions.

Currently in the folder I've got onto.development.ner, onto.train.ner, onto.test.ner as you can see on image below:

Could you please advise what am I doing wrong? Thanks.

yhcc commented 3 years ago

You should make sure the first column and second column of your data are tokens and labels, respectively. Based on the sample from https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO/blob/master/onto.test.ner.sample , the results put the label in the last column. You can also change the following code https://github.com/yhcc/BARTNER/blob/5d562fde9ff4dfe5cd8df9e2b30a3d0fb7ae5917/data/pipe.py#L249

to super().__init__(headers=headers, indexes=[0, -1]) , if you do not like to change your data file. The means the loader will regard the last column as the label column.

Dimiftb commented 3 years ago

Hi @yhcc,

Thank you very much for your reply. This easily fixed the issue. I managed to train the model, however I was wondering how can I display metrics (F1, recall, precision) on the test set?

This is the current output that I have once execution has finished:

yhcc commented 3 years ago

We follow previous paper merge the dev and train sets as the train set. Therefore, for the conll2003 dataset, the dev metric is the final test metric.

Dimiftb commented 3 years ago

Hi @yhcc,

Thanks for your reply. How can I go about merging the train and the dev sets? Is there functionality for it already? Also how do I get the metric to display?

Thank you very much for helping me thus far

yhcc commented 3 years ago

The merging will happend in https://github.com/yhcc/BARTNER/blob/a42c3bb84f2bec09e02b30f26beae9a2b4d0b868/train.py#L220

The metric will display once you train several epochs (15 epochs for conll2003). We set this because based on our experiments, the best performance will only occur after this epoch, for the sake of saving evaluation time, the code only evaluates after this epoch. You can change thi behavior by change https://github.com/yhcc/BARTNER/blob/a42c3bb84f2bec09e02b30f26beae9a2b4d0b868/train.py#L49 to 1

yhcc / BARTNER

Error loading the datasets #2