run_stg_joint.sh数据加载遇见的问题

Candice52 commented 1 year ago

如题，在加载项目自带的训练数据的时候会遇见如下问题：

以“都市桃花源”为主题，用两个巨大而奇特的魔比斯环为载体，世博湖南馆将吸引众多参观者的眼球而流连忘返。 Processing train Dataset[Tagger/Gen Part]: 13%|████████▍ | 2651/19758 [00:01<00:10, 1683.82it/s] Traceback (most recent call last): File "joint_stg.py", line 76, in train(args, checkp) File "joint_stg.py", line 32, in train Trainset = JointDataset(args, train_dir, 'train') File "/home/jydong/FCGEC-main/model/STG-correction/DataProcessor/JointDataset.py", line 32, in init self.gen_token, self.genwd_idx, self.tgt_mlm = self._process_tagger(self.sentences, self.operates) File "/home/jydong/FCGEC-main/model/STG-correction/DataProcessor/JointDataset.py", line 196, in _process_tagger gen_token, gen_label = tagger2generator(tokens, label_comb['tagger'], label_comb['mask_label']) File "/home/jydong/FCGEC-main/model/STG-correction/utils/mask.py", line 44, in convert_tagger2generator post_sequence.append(tokens[index]) IndexError: list index out of range

这个怎么解决呀？

Candice52 commented 1 year ago

应该是我用的预训练语言模型的vocab不包含某些字符……然后就没法编辑那个token？比如这个句子就是因为有个空格……去掉空格就好了

xlxwalex commented 1 year ago

您好，这个问题确实是由于空格引起的，在转换时候数据中是不能有任何中/英文空格的.否则convertor会出现对齐错误，我稍后会再检查一遍数据确保所有样本能被脚本正常转换，感谢您的反馈！

xlxwalex commented 1 year ago

您好，我们已经修复了会引起程序报错的所有数据，您可以使用更新后的数据，具体更新内容已附在README中！

xlxwalex / FCGEC

run_stg_joint.sh数据加载遇见的问题 #8