xlxwalex / FCGEC

The Corpus & Code for EMNLP 2022 paper "FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction" | FCGEC中文语法纠错语料及STG模型
https://aclanthology.org/2022.findings-emnlp.137
Apache License 2.0
104 stars 12 forks source link

run_stg_joint.sh数据加载遇见的问题 #8

Closed Candice52 closed 1 year ago

Candice52 commented 1 year ago

如题,在加载项目自带的训练数据的时候会遇见如下问题:

以“都市桃花源”为主 题,用两个巨大而奇特的魔比斯环为载体,世博湖南馆将吸引众多参观者的眼球而流连忘返。 Processing train Dataset[Tagger/Gen Part]: 13%|████████▍ | 2651/19758 [00:01<00:10, 1683.82it/s] Traceback (most recent call last): File "joint_stg.py", line 76, in train(args, checkp) File "joint_stg.py", line 32, in train Trainset = JointDataset(args, train_dir, 'train') File "/home/jydong/FCGEC-main/model/STG-correction/DataProcessor/JointDataset.py", line 32, in init self.gen_token, self.genwd_idx, self.tgt_mlm = self._process_tagger(self.sentences, self.operates) File "/home/jydong/FCGEC-main/model/STG-correction/DataProcessor/JointDataset.py", line 196, in _process_tagger gen_token, gen_label = tagger2generator(tokens, label_comb['tagger'], label_comb['mask_label']) File "/home/jydong/FCGEC-main/model/STG-correction/utils/mask.py", line 44, in convert_tagger2generator post_sequence.append(tokens[index]) IndexError: list index out of range

这个怎么解决呀?

Candice52 commented 1 year ago

应该是我用的预训练语言模型的vocab不包含某些字符……然后就没法编辑那个token?比如这个句子就是因为有个空格……去掉空格就好了

xlxwalex commented 1 year ago

您好,这个问题确实是由于空格引起的,在转换时候数据中是不能有任何中/英文空格的.否则convertor会出现对齐错误,我稍后会再检查一遍数据确保所有样本能被脚本正常转换,感谢您的反馈!

xlxwalex commented 1 year ago

您好,我们已经修复了会引起程序报错的所有数据,您可以使用更新后的数据,具体更新内容已附在README中!