Data - Githubissues

xlxwalex / FCGEC

The Corpus & Code for EMNLP 2022 paper "FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction" | FCGEC中文语法纠错语料及STG模型

https://aclanthology.org/2022.findings-emnlp.137

Apache License 2.0

108 stars 12 forks source link

Data #20

Closed Zari222 closed 1 year ago

Zari222 commented 1 year ago

Hi, I'm actully new in GEC and have some question if possible please help me. I saw that for GEC tasks the dataset format is in source/target file. and for evaluation it is in M2 format file. But i think you used json format for both training and evaluation. What is the diffrence between them? Also for computing the metrics i saw you used ChERRANT. Is it same as ERRANT? I mean does it get data in M2 format and calculate metrics?

xlxwalex commented 1 year ago

Hi,

exactly, the json file is just utilized to store the data in our FCGEC. And the data will be converted to .csv file (with sentence / label columns) in training and testing periods via preprocess_data.py .

Different from other GEC works, we use operations (rather than parallel corpus with the form of source / target pairs) to correct the sentences. More details can be found in our paper and data folder in this repo.

As for evaluation, we use ChERRANT, which is borrowed from MuCGEC to calculate the precision, recall and F0.5 score. As far as I know, it can be seen as a Chinese version of the ERRANT. Therefore, ChERRANT also need to convert the parallel data to M2 format and compute the metrics.

Moreover, there is a similar issue ISSUE17 which may help you.

If you have any questions, feel free to add the comments here!

Zari222 commented 1 year ago

So becuase you use operations (rather than parallel corpus with the form of source / target pairs) to correct the sentences your format is in json so that you can have operation option and error flag keys? Also about other works that use parallel format and there is no way to detect which sentence is correct(don't have flags), won't it cause problem? Did you yourself collect the data and make it in json format? Is it a difficulat task? becuase i wanted to create my dataset myself.

xlxwalex commented 1 year ago

No, actually our FCGEC do not only focus on grammatical error correction, we split the grammatical task into three subtasks. They are grammatical error detection(error_flag), error type identification (error_type) and GEC. Therefore, our corpus is different from other GEC datasets to have these flags.

In FCGEC, we collect the data from two resources which can be found in our paper. Then we use our inner Annotation Tool (you can found descriptions from page 9-12) to annotate and generate json data.

If you want to create your own dataset and you have acquired the parallel data, i would like to recommend our script convert_seq2seq_to_operation.py. You can convert parallel data with source / target format to our operation format (though you may modify the script to fit your language).

Zari222 commented 1 year ago

Thank you.

xlxwalex commented 1 year ago

You're welcome :)