xlxwalex / FCGEC

The Corpus & Code for EMNLP 2022 paper "FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction" | FCGEC中文语法纠错语料及STG模型
https://aclanthology.org/2022.findings-emnlp.137
Apache License 2.0
108 stars 12 forks source link

IndexError: index 2992 is out of bounds for axis 0 with size 2992 #14

Closed kingfan1998 closed 1 year ago

kingfan1998 commented 1 year ago

你好,我将数据集换成自己的数据集(也是3000条),用之前训练好的模型参数,运行joint_evaluate.py,报错IndexError: index 2992 is out of bounds for axis 0 with size 2992 但是,用你的数据集test.csv就是好的,这该怎么解决

xlxwalex commented 1 year ago

请问有更详细的报错信息吗?包括报错位置之类的,有的话我可能可以提供更多帮助。

kingfan1998 commented 1 year ago

Traceback (most recent call last): File "/home/jinfan/git_code/FCGEC-main/model/STG-correction/joint_evaluate.py", line 172, in evaluate(args) File "/home/jinfan/git_code/FCGEC-main/model/STG-correction/joint_evaluate.py", line 111, in evaluate tag_gtstokens, , _ = reconstruct_tagger(padding(switch_gts, args.padding_size, args.padding_val), File "/home/jinfan/git_code/FCGEC-main/model/STG-correction/utils/data_utils.py", line 147, in reconstruct_tagger_V2 tag_cur = tagger[lidx] IndexError: index 2992 is out of bounds for axis 0 with size 2992

xlxwalex commented 1 year ago

这个报错似乎有一点奇怪,我的建议是: 在joint_evaluate.py的这个部分:

 for step, batch_data in enumerate(tqdm(TestLoader, desc='Processing Tagger')):
        # Process Data
        tokens = batch_data
        padded_token = padding(tokens, args.padding_size, args.padding_val)
        tagger_tokens.extend(padded_token.tolist())
        attn_mask = attention_mask(padded_token, args.padding_val).to(device)
        token_padded = torch.from_numpy(padded_token).to(device)
        # Model Value
        with torch.no_grad():
            tagger_logits, comb_logits = model.tagger(token_padded, attn_mask)
        tagger_preds = np.argmax(tagger_logits.detach().cpu().numpy(), axis=2).astype('int32')
        comb_preds = np.argmax(comb_logits.detach().cpu().numpy(), axis=2).astype('int32')
        pred_tagger.extend(tagger_preds)
        pred_comb.extend(comb_preds)
        met_masks.extend(attn_mask.detach().cpu().numpy())

    print('Construct Generator Data')
    tag_construct = (pred_tagger, pred_comb)
    tag_tokens, mlm_tgt_masks, tg_mapper = reconstruct_tagger(np.array(tagger_tokens), tag_construct)

pred_tagger.extend(tagger_preds)后加一行print('{} > {}, {}'.format(len(tagger_tokens), len(pred_tagger), len(tagger_tokens)= len(pred_tagger)))然后运行一下,看看最后一行是不是为False

kingfan1998 commented 1 year ago

true

xlxwalex commented 1 year ago

那再在STG-correction/utils/data_utils.py的144行后面print一下batchsize以及tagger,insmod的长度看看?这个似乎很奇怪.相等的话不应该会index越界吧

xlxwalex commented 1 year ago

3000 > 3000, True Construct Generator Data Construct Generator Data Processing Generator: 100%|██████████| 47/47 [00:03<00:00, 13.64it/s] Traceback (most recent call last): File "/home/jinfan/git_code/FCGEC-main/model/STG-correction/joint_evaluate.py", line 203, in evaluate(args) File "/home/jinfan/git_code/FCGEC-main/model/STG-correction/joint_evaluate.py", line 174, in evaluate outputs = fillin_tokens(tag_tokens, mlm_tgt_masks, pred_mlm) File "/home/jinfan/git_code/FCGEC-main/model/STG-correction/utils/data_utils.py", line 253, in fillin_tokens posts.append(mlm_tgts[tgt_counter]) IndexError: list index out of range

Start to constrcut final output

这个你在data_utils的244行之后print一下mlm_tgts的len以及mlm_masks的sum看看是否相同,如果不同的话就要debug一下tagger_generator了

kingfan1998 commented 1 year ago

那再在STG-correction/utils/data_utils.py的144行后面print一下batchsize以及tagger,insmod的长度看看?这个似乎很奇怪.相等的话不应该会index越界吧

我把数据减少到200条,可以运行了

xlxwalex commented 1 year ago

嗯嗯好的,这个错误因为我没有数据也不是很能推出来具体情况,可能需要你debug一下,如果有其他问题的话欢迎回复