xhw205 / GPLinker_torch

CMeIE/CBLUE/CHIP/实体关系抽取/SPO抽取
207 stars 14 forks source link

RuntimeError: CUDA error: device-side assert triggered #14

Closed dr-GitHub-account closed 1 year ago

dr-GitHub-account commented 1 year ago

首先感谢大佬分享代码!

CMeIE数据集能跑通了,后来换了一个自己的数据集,出现如下报错:

Traceback (most recent call last): File "train.py", line 213, in loss3 = sparse_multilabel_categorical_crossentropy(y_true=batch_tail_labels, y_pred=logits3, mask_zero=True) File "GPLinker_torch/nets/gpNet.py", line 40, in sparse_multilabel_categorical_crossentropy loss = torch.mean(torch.sum(pos_loss + neg_loss)) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [0,0,0], thread: [1,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.

请问可能是什么原因呢?不知道有没有人遇到类似问题。看报错可能是loss计算中的原因。输出了一下,能跑几个batch,然后有一个batch突然报这个错,报错这个batch一直到计算loss前的一行代码 logits1, logits2, logits3 = net(batch_token_ids, batch_mask_ids, batch_token_type_ids) 这里都是正常运行。网上还有说这种是索引越界造成的,但我看logits1维度(batch_size, 2, sequence_length, sequence_length),logits2维度(batch_size, len(schema), sequence_length, sequence_length),logits3维度(batch_size, len(schema), sequence_length, sequence_length)都是正常的。

xhw205 commented 1 year ago

RuntimeError: CUDA error: device-side assert triggered 这个错一般是数据的问题,predict和label没对应。 例如predict了3个类,但是label里边却有4个类

dr-GitHub-account commented 1 year ago

谢谢!我再检查检查。