Structure of Dataset - Githubissues

xlxwalex / FCGEC

The Corpus & Code for EMNLP 2022 paper "FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction" | FCGEC中文语法纠错语料及STG模型

https://aclanthology.org/2022.findings-emnlp.137

Apache License 2.0

104 stars 12 forks source link

Structure of Dataset #17

Closed saramoeini20 closed 1 year ago

saramoeini20 commented 1 year ago

Hi, I'm kind of beginner at GEC and i had a question about structure of dataset because I wanted to create it myself for my work. I see the format of your data is in json and sometimes i see the M2 format or parallel file format. Are they different from each other and where should we use each one of them? if you will help me i would be thankful.

xlxwalex commented 1 year ago

Hi, The format of the data in our FCGEC is different from M2 format. We utilize operation-oriented paradigm to annotate the dataset. More details and examples of our data can be found in Appendix B of our paper (https://aclanthology.org/2022.findings-emnlp.137.pdf) and data folder.

Besides, M2 format is only utilized to compute the performances of the model ( with precision, recall and F0.5 metrics), which we borrow from MuCGEC (ChERRANT) . You can find more details in scorer folder and Section 4.1 in our paper.

If you want to create the data as the format in our FCGEC, you can use our convert_seq2seq_to_operation.py script. The descriptions of the algorithm can be found in README of the scripts folder and Algorithm 1 in our paper. It is convenient to convert normal seq2seq data to our operation format.

And if you have any more questions, feel free to add the comments here!

saramoeini20 commented 1 year ago

Is convert_seq2seq_to_operation.py script just for Chinese? If i want it in another language i should modify it or it can't be used?

saramoeini20 commented 1 year ago

And for computing the performances of the model, Just M2 format is usable in GEC tasks?

xlxwalex commented 1 year ago

Is convert_seq2seq_to_operation.py script just for Chinese? If i want it in another language i should modify it or it can't be used?

Yes, the convert_seq2seq_to_operation.py script can only be utilized to convert Chinese data, but you can modify it to other language (e.g., in English, you can regard each word as a character to match).

xlxwalex commented 1 year ago

And for computing the performances of the model, Just M2 format is usable in GEC tasks?

Yes, for precision, recall and F0.5 metrics in GEC task, the predictions and ground truths are processed to parallel form and then be converted to M2 format to compute the metrics in ChERRANT.

saramoeini20 commented 1 year ago

Thank you so much for your complete and timely response.

xlxwalex commented 1 year ago

You're welcome :)