Closed Candice52 closed 1 year ago
应该是我用的预训练语言模型的vocab不包含某些字符……然后就没法编辑那个token?比如这个句子就是因为有个空格……去掉空格就好了
您好,这个问题确实是由于空格引起的,在转换时候数据中是不能有任何中/英文空格的.否则convertor会出现对齐错误,我稍后会再检查一遍数据确保所有样本能被脚本正常转换,感谢您的反馈!
您好,我们已经修复了会引起程序报错的所有数据,您可以使用更新后的数据,具体更新内容已附在README中!
如题,在加载项目自带的训练数据的时候会遇见如下问题:
以“都市桃花源”为主 题,用两个巨大而奇特的魔比斯环为载体,世博湖南馆将吸引众多参观者的眼球而流连忘返。 Processing train Dataset[Tagger/Gen Part]: 13%|████████▍ | 2651/19758 [00:01<00:10, 1683.82it/s] Traceback (most recent call last): File "joint_stg.py", line 76, in
train(args, checkp)
File "joint_stg.py", line 32, in train
Trainset = JointDataset(args, train_dir, 'train')
File "/home/jydong/FCGEC-main/model/STG-correction/DataProcessor/JointDataset.py", line 32, in init
self.gen_token, self.genwd_idx, self.tgt_mlm = self._process_tagger(self.sentences, self.operates)
File "/home/jydong/FCGEC-main/model/STG-correction/DataProcessor/JointDataset.py", line 196, in _process_tagger
gen_token, gen_label = tagger2generator(tokens, label_comb['tagger'], label_comb['mask_label'])
File "/home/jydong/FCGEC-main/model/STG-correction/utils/mask.py", line 44, in convert_tagger2generator
post_sequence.append(tokens[index])
IndexError: list index out of range
这个怎么解决呀?