z814081807 / DeepNER

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch
922 stars 229 forks source link

关于词汇表以及load_model_and_parallel函数的问题 #1

Closed yanqiangmiffy closed 3 years ago

yanqiangmiffy commented 3 years ago
  1. (注意:需人工将 vocab.txt 中两个 [unused] 转换成 [INV] 和 [BLANK]),这个需要一定替换吗?不替换会报错吗?
  2. 目前训练阶段没有问题,在进行crf_evaluation时: load_model_and_parallel时会报错,尝试了多次其中错误有时候不太一样,有以下三种: 第一种:
    /anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1676, in linear
    output = input.matmul(weight.t())
    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
    01/09/2021 11:03:23 - INFO - wandb.internal.internal -   Internal process exited

    第二种:

    anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1678, in linear
    output += bias
    RuntimeError: CUDA error: device-side assert triggered
    01/09/2021 10:21:48 - INFO - wandb.internal.internal -   Internal process exited

第三种: context_layer = context_layer.permute(0, 2, 1, 3)acontiguous() RuntimeError: CUDA error: device-side assert triggered

我Google查了这个bug,给的比较多的答案是:label的索引有问题,因为数据集不是天池的数据集,想问下如果标注数据中没有S这个标签会导致出错吗(BMES);另外一个答案是GPU OOM,想问下如果单卡的话会不会出现这个问题:

01/09/2021 11:03:13 - INFO - src.utils.trainer -   Saving model & optimizer & scheduler checkpoint to ./out/roberta_wwm_wd_crf/checkpoint-1005
01/09/2021 11:03:16 - INFO - src.utils.functions_utils -   Load model from ./out/roberta_wwm_wd_crf/checkpoint-603/model.pt
01/09/2021 11:03:17 - INFO - src.utils.functions_utils -   Load model from ./out/roberta_wwm_wd_crf/checkpoint-804/model.pt
01/09/2021 11:03:17 - INFO - src.utils.functions_utils -   Load model from ./out/roberta_wwm_wd_crf/checkpoint-1005/model.pt
01/09/2021 11:03:18 - INFO - src.utils.functions_utils -   Save swa model in: ./out/roberta_wwm_wd_crf/checkpoint-100000
01/09/2021 11:03:21 - INFO - src.utils.trainer -   Train done
../../bert/torch_roberta_wwm/vocab.txt
01/09/2021 11:03:21 - INFO - src.preprocess.processor -   Convert 738 examples to features
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
01/09/2021 11:03:22 - INFO - src.preprocess.processor -   Build 738 features
['0']
cuda:0
01/09/2021 11:03:22 - INFO - src.utils.functions_utils -   Load ckpt from ./out/roberta_wwm_wd_crf/checkpoint-201/model.pt
01/09/2021 11:03:23 - INFO - src.utils.functions_utils -   Use single gpu in: ['0']
Traceback (most recent call last):
  File "main.py", line 215, in <module>
    training(args)
  File "main.py", line 136, in training
    train_base(opt, train_examples, dev_examples)
  File "main.py", line 78, in train_base
    tmp_metric_str, tmp_f1 = crf_evaluation(model, dev_info, device, ent2id)
  File "/home/quincyqiang/Projects/Water-Conservancy-KG/DeepNER/src/utils/evaluator.py", line 150, in crf_evaluation
    for tmp_pred in get_base_out(model, dev_loader, device):
  File "/home/quincyqiang/Projects/Water-Conservancy-KG/DeepNER/src/utils/evaluator.py", line 22, in get_base_out

从上面的错误可以看出来,前面直接加载了checkpoint-603,checkpoint-804等,但是下面同时进行checkpoint-201评估,是不是前面加载了会导致后面内存不足?

yanqiangmiffy commented 3 years ago

https://github.com/huggingface/transformers/issues/1805#issuecomment-554758144

It's probably because your token embeddings size (vocab size) doesn't match with pre-trained model. Do model.resize_token_embeddings(len(tokenizer)) before training. Please check #1848 and #1849

难道是词汇表不一致的原因?

z814081807 commented 3 years ago
  1. (注意:需人工将 vocab.txt 中两个 [unused] 转换成 [INV] 和 [BLANK]),这个需要一定替换吗?不替换会报错吗?
  2. 目前训练阶段没有问题,在进行crf_evaluation时: load_model_and_parallel时会报错,尝试了多次其中错误有时候不太一样,有以下三种: 第一种:
/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1676, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
01/09/2021 11:03:23 - INFO - wandb.internal.internal -   Internal process exited

第二种:

anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1678, in linear
    output += bias
RuntimeError: CUDA error: device-side assert triggered
01/09/2021 10:21:48 - INFO - wandb.internal.internal -   Internal process exited

第三种: context_layer = context_layer.permute(0, 2, 1, 3)acontiguous() RuntimeError: CUDA error: device-side assert triggered

我Google查了这个bug,给的比较多的答案是:label的索引有问题,因为数据集不是天池的数据集,想问下如果标注数据中没有S这个标签会导致出错吗(BMES);另外一个答案是GPU OOM,想问下如果单卡的话会不会出现这个问题:

01/09/2021 11:03:13 - INFO - src.utils.trainer -   Saving model & optimizer & scheduler checkpoint to ./out/roberta_wwm_wd_crf/checkpoint-1005
01/09/2021 11:03:16 - INFO - src.utils.functions_utils -   Load model from ./out/roberta_wwm_wd_crf/checkpoint-603/model.pt
01/09/2021 11:03:17 - INFO - src.utils.functions_utils -   Load model from ./out/roberta_wwm_wd_crf/checkpoint-804/model.pt
01/09/2021 11:03:17 - INFO - src.utils.functions_utils -   Load model from ./out/roberta_wwm_wd_crf/checkpoint-1005/model.pt
01/09/2021 11:03:18 - INFO - src.utils.functions_utils -   Save swa model in: ./out/roberta_wwm_wd_crf/checkpoint-100000
01/09/2021 11:03:21 - INFO - src.utils.trainer -   Train done
../../bert/torch_roberta_wwm/vocab.txt
01/09/2021 11:03:21 - INFO - src.preprocess.processor -   Convert 738 examples to features
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
01/09/2021 11:03:22 - INFO - src.preprocess.processor -   Build 738 features
['0']
cuda:0
01/09/2021 11:03:22 - INFO - src.utils.functions_utils -   Load ckpt from ./out/roberta_wwm_wd_crf/checkpoint-201/model.pt
01/09/2021 11:03:23 - INFO - src.utils.functions_utils -   Use single gpu in: ['0']
Traceback (most recent call last):
  File "main.py", line 215, in <module>
    training(args)
  File "main.py", line 136, in training
    train_base(opt, train_examples, dev_examples)
  File "main.py", line 78, in train_base
    tmp_metric_str, tmp_f1 = crf_evaluation(model, dev_info, device, ent2id)
  File "/home/quincyqiang/Projects/Water-Conservancy-KG/DeepNER/src/utils/evaluator.py", line 150, in crf_evaluation
    for tmp_pred in get_base_out(model, dev_loader, device):
  File "/home/quincyqiang/Projects/Water-Conservancy-KG/DeepNER/src/utils/evaluator.py", line 22, in get_base_out

从上面的错误可以看出来,前面直接加载了checkpoint-603,checkpoint-804等,但是下面同时进行checkpoint-201评估,是不是前面加载了会导致后面内存不足?

  1. 可以不替换也可以正常运行,数据处理部分会将空格替换成[BLANK], 如果词汇表不进行替换,[BLANK]会被认为是[UNK],效果略有下降。
  2. 很有可能是显存不足的原因,建议尝试单卡下 小batch_size运行,确保显存充足,排除可能性。
yanqiangmiffy commented 3 years ago

谢谢回复。确实是GPU OOM导致的问题,测试机器为12G 1、max_seq_length在cut_text分句对部分句子不起作用,个人测试数据的句子最大长度为150左右,所以一开始我设置max_seq_length=120的时候,会引起如下错误

/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

2、上面提到的RuntimeError: CUDA error: device-side assert triggered的问题是由于GPU OOM导致 训练的时候没有问题,但是在进行验证集评估的时候会引起这个错误,是因为需要加载多个模型,第一个可能会加载成功,但是记载多个的模型权重的话会引起GPU OOM 所以可以训练和验证评估分开运行,

#train(opt, model, train_dataset)

https://github.com/z814081807/DeepNER/blob/8c4abc21676af50ede29dce90bfac4892b36a1c5/main.py#L44

单独验证集评估结果:

01/09/2021 14:26:07 - INFO - src.preprocess.processor -   Build 4809 features
../../bert/torch_roberta_wwm/config.json
../../bert/torch_roberta_wwm
../../bert/torch_roberta_wwm/vocab.txt
01/09/2021 14:26:10 - INFO - src.preprocess.processor -   Convert 738 examples to features
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
01/09/2021 14:26:11 - INFO - src.preprocess.processor -   Build 738 features
['0']
cuda:0
01/09/2021 14:26:11 - INFO - src.utils.functions_utils -   Load ckpt from ./out/roberta_wwm_wd_crf/checkpoint-602/model.pt
01/09/2021 14:26:14 - INFO - src.utils.functions_utils -   Use single gpu in: ['0']
01/09/2021 14:26:45 - INFO - __main__ -   In step 602:
 [MIRCO] precision: 0.8084, recall: 0.8157, f1: 0.8101
['0']
cuda:0
01/09/2021 14:26:45 - INFO - src.utils.functions_utils -   Load ckpt from ./out/roberta_wwm_wd_crf/checkpoint-1204/model.pt
01/09/2021 14:26:45 - INFO - src.utils.functions_utils -   Use single gpu in: ['0']
01/09/2021 14:27:14 - INFO - __main__ -   In step 1204:
 [MIRCO] precision: 0.8048, recall: 0.8324, f1: 0.8170
01/09/2021 14:27:14 - INFO - __main__ -   Max f1 is: 0.8170151702997139, in step 1204
01/09/2021 14:27:14 - INFO - __main__ -   ./out/roberta_wwm_wd_crf/checkpoint-602已删除
01/09/2021 14:27:14 - INFO - root -   ----------本次容器运行时长:0:01:18-----------

3、另外在evaluator.py中role_metric = np.zeros([13, 3]),13为entity_types的种类个数,这里可以通过参数进行设定,num_labels或者len(ENTITY_TYPES)传参