yeyupiaoling / PPASR

基于PaddlePaddle实现端到端中文语音识别,从入门到实战,超简单的入门案例,超实用的企业项目。支持当前最流行的DeepSpeech2、Conformer、Squeezeformer模型
Apache License 2.0
805 stars 128 forks source link

运行训练后报错 #185

Open account15222 opened 1 month ago

account15222 commented 1 month ago

日志信息如下: cuda和cudnn版本都是经过匹配的 paddlepaddle-gpu是2.4.2 cuda11.6 版本 [2024-08-14 12:57:43 INFO ] trainer:train:545 - 训练数据:75118 [2024-08-14 12:57:44 INFO ] trainer:train_epoch:409 - Train epoch: [1/200], batch: [0/9389], loss: 39.20464, learning_rate: 0.00000008, reader_cost: 0.5070, batch_cost: 0.5174, ips: 7.8093 speech/sec, eta: 22 days, 6:20:58 [2024-08-14 12:57:55 INFO ] trainer:train_epoch:409 - Train epoch: [1/200], batch: [100/9389], loss: 53.33932, learning_rate: 0.00000108, reader_cost: 0.0001, batch_cost : 0.1034, ips: 77.2359 speech/sec, eta: 2 days, 6:01:26 [2024-08-14 12:58:06 INFO ] trainer:train_epoch:409 - Train epoch: [1/200], batch: [200/9389], loss: 50.53806, learning_rate: 0.00000208, reader_cost: 0.0002, batch_cost : 0.1179, ips: 67.7494 speech/sec, eta: 2 days, 13:35:05 [2024-08-14 12:58:18 INFO ] trainer:train_epoch:409 - Train epoch: [1/200], batch: [300/9389], loss: 47.89396, learning_rate: 0.00000308, reader_cost: 0.0002, batch_cost : 0.1149, ips: 69.5460 speech/sec, eta: 2 days, 11:59:23 [2024-08-14 12:58:29 INFO ] trainer:train_epoch:409 - Train epoch: [1/200], batch: [400/9389], loss: 47.97786, learning_rate: 0.00000408, reader_cost: 0.0001, batch_cost : 0.1057, ips: 75.5520 speech/sec, eta: 2 days, 7:13:02 [2024-08-14 12:58:40 INFO ] trainer:train_epoch:409 - Train epoch: [1/200], batch: [500/9389], loss: 46.66723, learning_rate: 0.00000508, reader_cost: 0.0002, batch_cost : 0.1130, ips: 70.6706 speech/sec, eta: 2 days, 11:01:38 [2024-08-14 12:58:52 INFO ] trainer:train_epoch:409 - Train epoch: [1/200], batch: [600/9389], loss: 47.74996, learning_rate: 0.00000608, reader_cost: 0.0002, batch_cost : 0.1165, ips: 68.5743 speech/sec, eta: 2 days, 12:49:39 [2024-08-14 12:59:02 INFO ] trainer:train_epoch:409 - Train epoch: [1/200], batch: [700/9389], loss: 46.63397, learning_rate: 0.00000708, reader_cost: 0.0001, batch_cost : 0.1058, ips: 75.5091 speech/sec, eta: 2 days, 7:14:15 Traceback (most recent call last): File "train.py", line 22, in trainer.train(save_model_path=args.save_model_path, File "/data/Speech/PPASR/ppasr/trainer.py", line 568, in train self.train_epoch(epoch_id=epoch_id, save_model_path=save_model_path, writer=writer, nranks=nranks) File "/data/Speech/PPASR/ppasr/trainer.py", line 381, in train_epoch loss.backward() File "/root/anaconda3/envs/asr/lib/python3.8/site-packages/decorator.py", line 232, in fun return caller(func, *(extras + args), **kw) File "/root/anaconda3/envs/asr/lib/python3.8/site-packages/paddle/fluid/wrapped_decorator.py", line 26, in impl__ return wrapped_func(*args, **kwargs) File "/root/anaconda3/envs/asr/lib/python3.8/site-packages/paddle/fluid/framework.py", line 534, in impl__ return func(*args, **kwargs) File "/root/anaconda3/envs/asr/lib/python3.8/site-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 297, in backward core.eager.run_backward([self], grad_tensor, retain_graph) OSError: (External) CUDNN error(14), CUDNN_STATUS_VERSION_MISMATCH. [Hint: Please search for the error code(14) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advi ce about CUDNN Error.] (at /paddle/paddle/phi/kernels/gpudnn/softmax_gpudnn.h:888)

yeyupiaoling commented 1 month ago

@account15222 不确定是不是显存不足,你可以减少batch size试试看