Closed yanqiangmiffy closed 3 years ago
https://github.com/huggingface/transformers/issues/1805#issuecomment-554758144
It's probably because your token embeddings size (vocab size) doesn't match with pre-trained model. Do model.resize_token_embeddings(len(tokenizer)) before training. Please check #1848 and #1849
难道是词汇表不一致的原因?
- (注意:需人工将 vocab.txt 中两个 [unused] 转换成 [INV] 和 [BLANK]),这个需要一定替换吗?不替换会报错吗?
- 目前训练阶段没有问题,在进行crf_evaluation时: load_model_and_parallel时会报错,尝试了多次其中错误有时候不太一样,有以下三种: 第一种:
/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1676, in linear output = input.matmul(weight.t()) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` 01/09/2021 11:03:23 - INFO - wandb.internal.internal - Internal process exited
第二种:
anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1678, in linear output += bias RuntimeError: CUDA error: device-side assert triggered 01/09/2021 10:21:48 - INFO - wandb.internal.internal - Internal process exited
第三种:
context_layer = context_layer.permute(0, 2, 1, 3)acontiguous() RuntimeError: CUDA error: device-side assert triggered
我Google查了这个bug,给的比较多的答案是:label的索引有问题,因为数据集不是天池的数据集,想问下如果标注数据中没有S这个标签会导致出错吗(BMES);另外一个答案是GPU OOM,想问下如果单卡的话会不会出现这个问题:
01/09/2021 11:03:13 - INFO - src.utils.trainer - Saving model & optimizer & scheduler checkpoint to ./out/roberta_wwm_wd_crf/checkpoint-1005 01/09/2021 11:03:16 - INFO - src.utils.functions_utils - Load model from ./out/roberta_wwm_wd_crf/checkpoint-603/model.pt 01/09/2021 11:03:17 - INFO - src.utils.functions_utils - Load model from ./out/roberta_wwm_wd_crf/checkpoint-804/model.pt 01/09/2021 11:03:17 - INFO - src.utils.functions_utils - Load model from ./out/roberta_wwm_wd_crf/checkpoint-1005/model.pt 01/09/2021 11:03:18 - INFO - src.utils.functions_utils - Save swa model in: ./out/roberta_wwm_wd_crf/checkpoint-100000 01/09/2021 11:03:21 - INFO - src.utils.trainer - Train done ../../bert/torch_roberta_wwm/vocab.txt 01/09/2021 11:03:21 - INFO - src.preprocess.processor - Convert 738 examples to features Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`. 01/09/2021 11:03:22 - INFO - src.preprocess.processor - Build 738 features ['0'] cuda:0 01/09/2021 11:03:22 - INFO - src.utils.functions_utils - Load ckpt from ./out/roberta_wwm_wd_crf/checkpoint-201/model.pt 01/09/2021 11:03:23 - INFO - src.utils.functions_utils - Use single gpu in: ['0'] Traceback (most recent call last): File "main.py", line 215, in <module> training(args) File "main.py", line 136, in training train_base(opt, train_examples, dev_examples) File "main.py", line 78, in train_base tmp_metric_str, tmp_f1 = crf_evaluation(model, dev_info, device, ent2id) File "/home/quincyqiang/Projects/Water-Conservancy-KG/DeepNER/src/utils/evaluator.py", line 150, in crf_evaluation for tmp_pred in get_base_out(model, dev_loader, device): File "/home/quincyqiang/Projects/Water-Conservancy-KG/DeepNER/src/utils/evaluator.py", line 22, in get_base_out
从上面的错误可以看出来,前面直接加载了checkpoint-603,checkpoint-804等,但是下面同时进行checkpoint-201评估,是不是前面加载了会导致后面内存不足?
谢谢回复。确实是GPU OOM导致的问题,测试机器为12G
1、max_seq_length在cut_text分句对部分句子不起作用,个人测试数据的句子最大长度为150左右,所以一开始我设置max_seq_length=120的时候,会引起如下错误
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [2,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
2、上面提到的RuntimeError: CUDA error: device-side assert triggered的问题是由于GPU OOM导致 训练的时候没有问题,但是在进行验证集评估的时候会引起这个错误,是因为需要加载多个模型,第一个可能会加载成功,但是记载多个的模型权重的话会引起GPU OOM 所以可以训练和验证评估分开运行,
#train(opt, model, train_dataset)
https://github.com/z814081807/DeepNER/blob/8c4abc21676af50ede29dce90bfac4892b36a1c5/main.py#L44
单独验证集评估结果:
01/09/2021 14:26:07 - INFO - src.preprocess.processor - Build 4809 features
../../bert/torch_roberta_wwm/config.json
../../bert/torch_roberta_wwm
../../bert/torch_roberta_wwm/vocab.txt
01/09/2021 14:26:10 - INFO - src.preprocess.processor - Convert 738 examples to features
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
01/09/2021 14:26:11 - INFO - src.preprocess.processor - Build 738 features
['0']
cuda:0
01/09/2021 14:26:11 - INFO - src.utils.functions_utils - Load ckpt from ./out/roberta_wwm_wd_crf/checkpoint-602/model.pt
01/09/2021 14:26:14 - INFO - src.utils.functions_utils - Use single gpu in: ['0']
01/09/2021 14:26:45 - INFO - __main__ - In step 602:
[MIRCO] precision: 0.8084, recall: 0.8157, f1: 0.8101
['0']
cuda:0
01/09/2021 14:26:45 - INFO - src.utils.functions_utils - Load ckpt from ./out/roberta_wwm_wd_crf/checkpoint-1204/model.pt
01/09/2021 14:26:45 - INFO - src.utils.functions_utils - Use single gpu in: ['0']
01/09/2021 14:27:14 - INFO - __main__ - In step 1204:
[MIRCO] precision: 0.8048, recall: 0.8324, f1: 0.8170
01/09/2021 14:27:14 - INFO - __main__ - Max f1 is: 0.8170151702997139, in step 1204
01/09/2021 14:27:14 - INFO - __main__ - ./out/roberta_wwm_wd_crf/checkpoint-602已删除
01/09/2021 14:27:14 - INFO - root - ----------本次容器运行时长:0:01:18-----------
3、另外在evaluator.py
中role_metric = np.zeros([13, 3]),13为entity_types的种类个数,这里可以通过参数进行设定,num_labels或者len(ENTITY_TYPES)传参
第二种:
第三种:
context_layer = context_layer.permute(0, 2, 1, 3)acontiguous() RuntimeError: CUDA error: device-side assert triggered
我Google查了这个bug,给的比较多的答案是:label的索引有问题,因为数据集不是天池的数据集,想问下如果标注数据中没有S这个标签会导致出错吗(BMES);另外一个答案是GPU OOM,想问下如果单卡的话会不会出现这个问题:
从上面的错误可以看出来,前面直接加载了checkpoint-603,checkpoint-804等,但是下面同时进行checkpoint-201评估,是不是前面加载了会导致后面内存不足?