yeyupiaoling / PPASR

基于PaddlePaddle实现端到端中文语音识别,从入门到实战,超简单的入门案例,超实用的企业项目。支持当前最流行的DeepSpeech2、Conformer、Squeezeformer模型
Apache License 2.0
807 stars 128 forks source link

训练到一半突然出现CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED. #153

Closed navy7913 closed 1 year ago

navy7913 commented 1 year ago

你好,我训练deepspeech2模型到一半时会突然出现以下错误 作业系统: win10 显卡: 3060ti CUDA: 11.8 cudnn: 8.4

[2023-03-25 02:34:03 INFO ] trainer:train:512 - Test epoch: 28, time/epoch: 0:42:45.056698, loss: 12.85140, cer: 1.00000, best cer: 1.00000 [2023-03-25 02:34:03 INFO ] trainer:train:515 - ====================================================================== [2023-03-25 02:34:03 INFO ] trainer:save_checkpoint:271 - 已保存模型:models/deepspeech2_streaming_fbank\best_model [2023-03-25 02:34:04 INFO ] trainer:save_checkpoint:271 - 已保存模型:models/deepspeech2_streaming_fbank\epoch_28 [2023-03-25 02:34:05 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [0/9238], loss: 12.00028, learning_rate: 0.00015544, reader_cost: 0.3142, batch_cost: 0.1202, ips: 36.8387 speech/sec, eta: 7 days, 23:41:56 [2023-03-25 02:34:30 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [100/9238], loss: 12.32619, learning_rate: 0.00015541, reader_cost: 0.1042, batch_cost: 0.1467, ips: 64.2458 speech/sec, eta: 4 days, 13:54:49 [2023-03-25 02:34:55 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [200/9238], loss: 11.88217, learning_rate: 0.00015538, reader_cost: 0.1034, batch_cost: 0.1486, ips: 63.2113 speech/sec, eta: 4 days, 15:42:19 [2023-03-25 02:35:20 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [300/9238], loss: 12.71910, learning_rate: 0.00015535, reader_cost: 0.1026, batch_cost: 0.1486, ips: 64.1425 speech/sec, eta: 4 days, 14:04:36 [2023-03-25 02:35:46 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [400/9238], loss: 12.36381, learning_rate: 0.00015532, reader_cost: 0.1027, batch_cost: 0.1498, ips: 62.3330 speech/sec, eta: 4 days, 17:15:54 [2023-03-25 02:36:11 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [500/9238], loss: 13.30369, learning_rate: 0.00015529, reader_cost: 0.1024, batch_cost: 0.1503, ips: 63.1759 speech/sec, eta: 4 days, 15:44:49 [2023-03-25 02:36:36 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [600/9238], loss: 12.51611, learning_rate: 0.00015526, reader_cost: 0.1025, batch_cost: 0.1499, ips: 63.8509 speech/sec, eta: 4 days, 14:33:31 [2023-03-25 02:37:01 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [700/9238], loss: 12.32817, learning_rate: 0.00015523, reader_cost: 0.1029, batch_cost: 0.1494, ips: 63.4647 speech/sec, eta: 4 days, 15:13:27 [2023-03-25 02:37:27 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [800/9238], loss: 12.73769, learning_rate: 0.00015520, reader_cost: 0.1033, batch_cost: 0.1493, ips: 62.8908 speech/sec, eta: 4 days, 16:13:56 [2023-03-25 02:37:52 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [900/9238], loss: 12.42688, learning_rate: 0.00015517, reader_cost: 0.1036, batch_cost: 0.1490, ips: 63.1540 speech/sec, eta: 4 days, 15:45:27 [2023-03-25 02:38:18 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [1000/9238], loss: 12.20085, learning_rate: 0.00015514, reader_cost: 0.1036, batch_cost: 0.1495, ips: 62.2785 speech/sec, eta: 4 days, 17:19:17 [2023-03-25 02:38:43 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [1100/9238], loss: 12.36750, learning_rate: 0.00015511, reader_cost: 0.1040, batch_cost: 0.1494, ips: 62.1926 speech/sec, eta: 4 days, 17:28:14 [2023-03-25 02:39:09 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [1200/9238], loss: 12.40964, learning_rate: 0.00015508, reader_cost: 0.1043, batch_cost: 0.1491, ips: 63.3021 speech/sec, eta: 4 days, 15:28:30 [2023-03-25 02:39:34 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [1300/9238], loss: 12.28406, learning_rate: 0.00015505, reader_cost: 0.1048, batch_cost: 0.1484, ips: 63.7373 speech/sec, eta: 4 days, 14:42:24 [2023-03-25 02:40:00 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [1400/9238], loss: 11.77928, learning_rate: 0.00015502, reader_cost: 0.1052, batch_cost: 0.1485, ips: 61.6848 speech/sec, eta: 4 days, 18:23:00 [2023-03-25 02:40:26 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [1500/9238], loss: 12.36194, learning_rate: 0.00015499, reader_cost: 0.1055, batch_cost: 0.1485, ips: 61.7726 speech/sec, eta: 4 days, 18:12:48 [2023-03-25 02:40:52 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [1600/9238], loss: 12.82302, learning_rate: 0.00015496, reader_cost: 0.1059, batch_cost: 0.1489, ips: 60.2808 speech/sec, eta: 4 days, 21:01:57 [2023-03-25 02:41:18 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [1700/9238], loss: 11.54686, learning_rate: 0.00015493, reader_cost: 0.1061, batch_cost: 0.1489, ips: 61.6313 speech/sec, eta: 4 days, 18:27:39 [2023-03-25 02:41:44 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [1800/9238], loss: 12.15347, learning_rate: 0.00015490, reader_cost: 0.1065, batch_cost: 0.1488, ips: 61.6937 speech/sec, eta: 4 days, 18:20:16 [2023-03-25 02:42:10 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [1900/9238], loss: 12.45951, learning_rate: 0.00015487, reader_cost: 0.1068, batch_cost: 0.1486, ips: 61.7138 speech/sec, eta: 4 days, 18:17:36 [2023-03-25 02:42:37 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [2000/9238], loss: 13.57224, learning_rate: 0.00015485, reader_cost: 0.1071, batch_cost: 0.1489, ips: 59.9899 speech/sec, eta: 4 days, 21:34:14 [2023-03-25 02:43:03 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [2100/9238], loss: 11.86536, learning_rate: 0.00015482, reader_cost: 0.1074, batch_cost: 0.1491, ips: 60.1560 speech/sec, eta: 4 days, 21:14:18 [2023-03-25 02:43:29 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [2200/9238], loss: 12.96084, learning_rate: 0.00015479, reader_cost: 0.1076, batch_cost: 0.1491, ips: 61.3158 speech/sec, eta: 4 days, 19:00:49 [2023-03-25 02:43:56 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [2300/9238], loss: 11.57289, learning_rate: 0.00015476, reader_cost: 0.1079, batch_cost: 0.1492, ips: 60.3927 speech/sec, eta: 4 days, 20:45:51 [2023-03-25 02:44:22 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [2400/9238], loss: 12.39834, learning_rate: 0.00015473, reader_cost: 0.1081, batch_cost: 0.1491, ips: 61.5584 speech/sec, eta: 4 days, 18:32:46 [2023-03-25 02:44:48 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [2500/9238], loss: 11.60647, learning_rate: 0.00015470, reader_cost: 0.1082, batch_cost: 0.1490, ips: 62.3267 speech/sec, eta: 4 days, 17:07:36 [2023-03-25 02:45:14 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [2600/9238], loss: 11.43141, learning_rate: 0.00015467, reader_cost: 0.1083, batch_cost: 0.1490, ips: 61.2061 speech/sec, eta: 4 days, 19:11:27 [2023-03-25 02:45:41 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [2700/9238], loss: 12.72747, learning_rate: 0.00015464, reader_cost: 0.1085, batch_cost: 0.1492, ips: 59.8622 speech/sec, eta: 4 days, 21:46:09 [2023-03-25 02:46:07 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [2800/9238], loss: 11.99560, learning_rate: 0.00015461, reader_cost: 0.1087, batch_cost: 0.1491, ips: 61.3276 speech/sec, eta: 4 days, 18:56:53 [2023-03-25 02:46:33 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [2900/9238], loss: 12.66451, learning_rate: 0.00015458, reader_cost: 0.1089, batch_cost: 0.1492, ips: 60.4142 speech/sec, eta: 4 days, 20:40:43 [2023-03-25 02:46:59 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [3000/9238], loss: 12.24936, learning_rate: 0.00015455, reader_cost: 0.1091, batch_cost: 0.1491, ips: 60.9399 speech/sec, eta: 4 days, 19:39:53 [2023-03-25 02:47:26 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [3100/9238], loss: 12.99915, learning_rate: 0.00015452, reader_cost: 0.1092, batch_cost: 0.1493, ips: 59.5782 speech/sec, eta: 4 days, 22:18:03 [2023-03-25 02:47:52 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [3200/9238], loss: 12.83142, learning_rate: 0.00015449, reader_cost: 0.1093, batch_cost: 0.1493, ips: 61.6323 speech/sec, eta: 4 days, 18:21:03 [2023-03-25 02:48:18 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [3300/9238], loss: 12.65654, learning_rate: 0.00015446, reader_cost: 0.1094, batch_cost: 0.1491, ips: 62.2218 speech/sec, eta: 4 days, 17:15:37 [2023-03-25 02:48:45 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [3400/9238], loss: 12.45209, learning_rate: 0.00015443, reader_cost: 0.1095, batch_cost: 0.1494, ips: 58.7600 speech/sec, eta: 4 days, 23:55:31 [2023-03-25 02:49:11 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [3500/9238], loss: 12.64452, learning_rate: 0.00015440, reader_cost: 0.1096, batch_cost: 0.1493, ips: 62.1244 speech/sec, eta: 4 days, 17:25:25 [2023-03-25 02:49:37 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [3600/9238], loss: 12.23901, learning_rate: 0.00015437, reader_cost: 0.1097, batch_cost: 0.1492, ips: 61.8231 speech/sec, eta: 4 days, 17:58:09 [2023-03-25 02:50:02 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [3700/9238], loss: 12.94510, learning_rate: 0.00015434, reader_cost: 0.1098, batch_cost: 0.1490, ips: 62.6774 speech/sec, eta: 4 days, 16:24:31 [2023-03-25 02:50:27 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [3800/9238], loss: 12.81586, learning_rate: 0.00015431, reader_cost: 0.1099, batch_cost: 0.1486, ips: 64.2239 speech/sec, eta: 4 days, 13:41:42 [2023-03-25 02:50:54 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [3900/9238], loss: 11.92724, learning_rate: 0.00015428, reader_cost: 0.1101, batch_cost: 0.1486, ips: 60.3503 speech/sec, eta: 4 days, 20:43:43 [2023-03-25 02:51:20 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [4000/9238], loss: 13.01337, learning_rate: 0.00015425, reader_cost: 0.1102, batch_cost: 0.1486, ips: 61.0308 speech/sec, eta: 4 days, 19:25:11 [2023-03-25 02:51:46 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [4100/9238], loss: 13.19012, learning_rate: 0.00015423, reader_cost: 0.1103, batch_cost: 0.1484, ips: 62.6773 speech/sec, eta: 4 days, 16:22:50 [2023-03-25 02:52:12 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [4200/9238], loss: 12.53538, learning_rate: 0.00015420, reader_cost: 0.1103, batch_cost: 0.1485, ips: 59.9955 speech/sec, eta: 4 days, 21:23:48 [2023-03-25 02:52:38 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [4300/9238], loss: 13.05978, learning_rate: 0.00015417, reader_cost: 0.1104, batch_cost: 0.1485, ips: 61.5932 speech/sec, eta: 4 days, 18:20:39 [2023-03-25 02:53:04 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [4400/9238], loss: 12.26651, learning_rate: 0.00015414, reader_cost: 0.1104, batch_cost: 0.1485, ips: 61.7117 speech/sec, eta: 4 days, 18:07:02 [2023-03-25 02:53:29 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [4500/9238], loss: 11.74872, learning_rate: 0.00015411, reader_cost: 0.1105, batch_cost: 0.1483, ips: 63.1397 speech/sec, eta: 4 days, 15:31:46 [2023-03-25 02:53:55 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [4600/9238], loss: 11.71583, learning_rate: 0.00015408, reader_cost: 0.1105, batch_cost: 0.1482, ips: 63.3987 speech/sec, eta: 4 days, 15:04:00 [2023-03-25 02:54:21 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [4700/9238], loss: 12.22635, learning_rate: 0.00015405, reader_cost: 0.1106, batch_cost: 0.1482, ips: 60.7744 speech/sec, eta: 4 days, 19:51:19 [2023-03-25 02:54:48 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [4800/9238], loss: 12.43849, learning_rate: 0.00015402, reader_cost: 0.1107, batch_cost: 0.1482, ips: 60.2300 speech/sec, eta: 4 days, 20:53:43 [2023-03-25 02:55:14 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [4900/9238], loss: 14.07115, learning_rate: 0.00015399, reader_cost: 0.1108, batch_cost: 0.1483, ips: 60.1452 speech/sec, eta: 4 days, 21:03:10 [2023-03-25 02:55:41 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [5000/9238], loss: 12.65611, learning_rate: 0.00015396, reader_cost: 0.1109, batch_cost: 0.1483, ips: 60.3302 speech/sec, eta: 4 days, 20:41:10 [2023-03-25 02:56:08 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [5100/9238], loss: 12.31212, learning_rate: 0.00015393, reader_cost: 0.1110, batch_cost: 0.1485, ips: 57.6949 speech/sec, eta: 5 days, 2:00:30 [2023-03-25 02:56:35 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [5200/9238], loss: 12.31747, learning_rate: 0.00015390, reader_cost: 0.1111, batch_cost: 0.1486, ips: 60.4494 speech/sec, eta: 4 days, 20:26:30 [2023-03-25 02:57:02 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [5300/9238], loss: 11.13073, learning_rate: 0.00015387, reader_cost: 0.1112, batch_cost: 0.1487, ips: 58.9833 speech/sec, eta: 4 days, 23:19:42 [2023-03-25 02:57:29 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [5400/9238], loss: 12.73579, learning_rate: 0.00015385, reader_cost: 0.1113, batch_cost: 0.1488, ips: 58.8641 speech/sec, eta: 4 days, 23:33:44 [2023-03-25 02:57:56 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [5500/9238], loss: 12.55455, learning_rate: 0.00015382, reader_cost: 0.1113, batch_cost: 0.1488, ips: 60.3368 speech/sec, eta: 4 days, 20:38:12 [2023-03-25 02:58:22 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [5600/9238], loss: 13.17403, learning_rate: 0.00015379, reader_cost: 0.1114, batch_cost: 0.1487, ips: 61.6573 speech/sec, eta: 4 days, 18:07:54 [2023-03-25 02:58:48 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [5700/9238], loss: 13.19934, learning_rate: 0.00015376, reader_cost: 0.1115, batch_cost: 0.1487, ips: 60.3802 speech/sec, eta: 4 days, 20:32:18 [2023-03-25 02:59:14 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [5800/9238], loss: 11.77936, learning_rate: 0.00015373, reader_cost: 0.1115, batch_cost: 0.1487, ips: 61.2391 speech/sec, eta: 4 days, 18:53:47 [2023-03-25 02:59:42 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [5900/9238], loss: 12.51447, learning_rate: 0.00015370, reader_cost: 0.1116, batch_cost: 0.1488, ips: 58.9146 speech/sec, eta: 4 days, 23:25:20 [2023-03-25 03:00:08 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [6000/9238], loss: 11.79191, learning_rate: 0.00015367, reader_cost: 0.1117, batch_cost: 0.1488, ips: 61.0895 speech/sec, eta: 4 days, 19:09:47 [2023-03-25 03:00:34 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [6100/9238], loss: 12.18262, learning_rate: 0.00015364, reader_cost: 0.1117, batch_cost: 0.1488, ips: 59.8801 speech/sec, eta: 4 days, 21:28:54 [2023-03-25 03:01:02 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [6200/9238], loss: 13.46445, learning_rate: 0.00015361, reader_cost: 0.1118, batch_cost: 0.1489, ips: 58.9350 speech/sec, eta: 4 days, 23:21:30 [2023-03-25 03:01:28 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [6300/9238], loss: 12.91749, learning_rate: 0.00015358, reader_cost: 0.1119, batch_cost: 0.1489, ips: 61.6605 speech/sec, eta: 4 days, 18:04:31 [2023-03-25 03:01:54 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [6400/9238], loss: 12.46080, learning_rate: 0.00015355, reader_cost: 0.1119, batch_cost: 0.1489, ips: 59.9169 speech/sec, eta: 4 days, 21:23:15 [2023-03-25 03:02:21 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [6500/9238], loss: 13.84410, learning_rate: 0.00015353, reader_cost: 0.1120, batch_cost: 0.1490, ips: 59.2519 speech/sec, eta: 4 days, 22:41:50 [2023-03-25 03:02:47 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [6600/9238], loss: 13.03474, learning_rate: 0.00015350, reader_cost: 0.1121, batch_cost: 0.1489, ips: 61.2842 speech/sec, eta: 4 days, 18:45:14 [2023-03-25 03:03:14 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [6700/9238], loss: 12.38390, learning_rate: 0.00015347, reader_cost: 0.1121, batch_cost: 0.1489, ips: 60.6936 speech/sec, eta: 4 days, 19:51:47 [2023-03-25 03:03:40 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [6800/9238], loss: 12.01420, learning_rate: 0.00015344, reader_cost: 0.1123, batch_cost: 0.1488, ips: 60.2328 speech/sec, eta: 4 days, 20:44:32 [2023-03-25 03:04:07 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [6900/9238], loss: 13.35949, learning_rate: 0.00015341, reader_cost: 0.1123, batch_cost: 0.1489, ips: 59.1250 speech/sec, eta: 4 days, 22:55:19 [2023-03-25 03:04:34 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [7000/9238], loss: 12.10045, learning_rate: 0.00015338, reader_cost: 0.1124, batch_cost: 0.1489, ips: 59.3164 speech/sec, eta: 4 days, 22:31:51 [2023-03-25 03:05:01 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [7100/9238], loss: 13.52868, learning_rate: 0.00015335, reader_cost: 0.1124, batch_cost: 0.1490, ips: 59.2911 speech/sec, eta: 4 days, 22:34:26 [2023-03-25 03:05:28 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [7200/9238], loss: 13.71923, learning_rate: 0.00015332, reader_cost: 0.1125, batch_cost: 0.1491, ips: 59.7265 speech/sec, eta: 4 days, 21:42:07 [2023-03-25 03:05:55 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [7300/9238], loss: 12.14063, learning_rate: 0.00015329, reader_cost: 0.1125, batch_cost: 0.1491, ips: 59.7916 speech/sec, eta: 4 days, 21:33:59 [2023-03-25 03:06:22 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [7400/9238], loss: 12.54584, learning_rate: 0.00015327, reader_cost: 0.1126, batch_cost: 0.1491, ips: 59.7929 speech/sec, eta: 4 days, 21:33:23 [2023-03-25 03:06:48 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [7500/9238], loss: 11.46366, learning_rate: 0.00015324, reader_cost: 0.1126, batch_cost: 0.1491, ips: 60.6510 speech/sec, eta: 4 days, 19:53:09 [2023-03-25 03:07:14 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [7600/9238], loss: 11.20753, learning_rate: 0.00015321, reader_cost: 0.1126, batch_cost: 0.1491, ips: 61.6685 speech/sec, eta: 4 days, 17:58:00 [2023-03-25 03:07:40 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [7700/9238], loss: 12.25919, learning_rate: 0.00015318, reader_cost: 0.1127, batch_cost: 0.1490, ips: 61.1263 speech/sec, eta: 4 days, 18:58:13 [2023-03-25 03:08:07 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [7800/9238], loss: 12.08134, learning_rate: 0.00015315, reader_cost: 0.1127, batch_cost: 0.1491, ips: 59.6879 speech/sec, eta: 4 days, 21:44:00 [2023-03-25 03:08:34 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [7900/9238], loss: 11.76338, learning_rate: 0.00015312, reader_cost: 0.1128, batch_cost: 0.1491, ips: 59.6670 speech/sec, eta: 4 days, 21:46:02 [2023-03-25 03:09:00 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [8000/9238], loss: 14.13501, learning_rate: 0.00015309, reader_cost: 0.1128, batch_cost: 0.1490, ips: 61.3180 speech/sec, eta: 4 days, 18:35:21 [2023-03-25 03:09:26 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [8100/9238], loss: 12.46890, learning_rate: 0.00015306, reader_cost: 0.1129, batch_cost: 0.1490, ips: 60.8173 speech/sec, eta: 4 days, 19:31:31 [2023-03-25 03:09:53 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [8200/9238], loss: 11.73706, learning_rate: 0.00015304, reader_cost: 0.1130, batch_cost: 0.1489, ips: 59.5249 speech/sec, eta: 4 days, 22:01:33 [2023-03-25 03:10:20 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [8300/9238], loss: 13.55510, learning_rate: 0.00015301, reader_cost: 0.1131, batch_cost: 0.1489, ips: 60.2591 speech/sec, eta: 4 days, 20:34:50 [2023-03-25 03:10:47 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [8400/9238], loss: 11.42852, learning_rate: 0.00015298, reader_cost: 0.1132, batch_cost: 0.1489, ips: 58.7034 speech/sec, eta: 4 days, 23:39:45 [2023-03-25 03:11:13 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [8500/9238], loss: 12.09673, learning_rate: 0.00015295, reader_cost: 0.1133, batch_cost: 0.1489, ips: 61.3245 speech/sec, eta: 4 days, 18:32:27 [2023-03-25 03:11:39 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [8600/9238], loss: 11.85386, learning_rate: 0.00015292, reader_cost: 0.1133, batch_cost: 0.1487, ips: 62.0676 speech/sec, eta: 4 days, 17:09:44 [2023-03-25 03:12:05 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [8700/9238], loss: 12.45733, learning_rate: 0.00015289, reader_cost: 0.1134, batch_cost: 0.1487, ips: 60.4972 speech/sec, eta: 4 days, 20:05:33 [2023-03-25 03:12:32 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [8800/9238], loss: 12.58044, learning_rate: 0.00015286, reader_cost: 0.1135, batch_cost: 0.1487, ips: 59.1468 speech/sec, eta: 4 days, 22:44:07 [2023-03-25 03:12:59 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [8900/9238], loss: 11.18323, learning_rate: 0.00015284, reader_cost: 0.1135, batch_cost: 0.1487, ips: 59.6856 speech/sec, eta: 4 days, 21:39:22 [2023-03-25 03:13:26 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [9000/9238], loss: 12.46362, learning_rate: 0.00015281, reader_cost: 0.1136, batch_cost: 0.1487, ips: 59.6046 speech/sec, eta: 4 days, 21:48:30 [2023-03-25 03:13:53 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [9100/9238], loss: 12.35082, learning_rate: 0.00015278, reader_cost: 0.1137, batch_cost: 0.1487, ips: 59.5122 speech/sec, eta: 4 days, 21:59:03 [2023-03-25 03:14:18 INFO ] trainer:train_epoch:359 - Train epoch: [29/200], batch: [9200/9238], loss: 12.79754, learning_rate: 0.00015275, reader_cost: 0.1137, batch_cost: 0.1486, ips: 62.6087 speech/sec, eta: 4 days, 16:08:30 [2023-03-25 03:14:28 INFO ] trainer:train:510 - ====================================================================== 100%|██████████| 449/449 [02:24<00:00, 3.11it/s] [2023-03-25 03:16:52 INFO ] trainer:train:512 - Test epoch: 29, time/epoch: 0:42:47.759470, loss: 12.83750, cer: 1.00000, best cer: 1.00000 [2023-03-25 03:16:52 INFO ] trainer:train:515 - ====================================================================== [2023-03-25 03:16:53 INFO ] trainer:save_checkpoint:271 - 已保存模型:models/deepspeech2_streaming_fbank\best_model [2023-03-25 03:16:55 INFO ] trainer:save_checkpoint:271 - 已保存模型:models/deepspeech2_streaming_fbank\epoch_29 [2023-03-25 03:16:55 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [0/9238], loss: 15.81416, learning_rate: 0.00015274, reader_cost: 0.2888, batch_cost: 0.2483, ips: 29.7888 speech/sec, eta: 9 days, 19:41:20 [2023-03-25 03:17:22 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [100/9238], loss: 12.42018, learning_rate: 0.00015271, reader_cost: 0.1171, batch_cost: 0.1530, ips: 59.8349 speech/sec, eta: 4 days, 21:19:48 [2023-03-25 03:17:48 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [200/9238], loss: 12.01488, learning_rate: 0.00015268, reader_cost: 0.1189, batch_cost: 0.1493, ips: 60.0901 speech/sec, eta: 4 days, 20:49:27 [2023-03-25 03:18:15 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [300/9238], loss: 12.54136, learning_rate: 0.00015265, reader_cost: 0.1192, batch_cost: 0.1480, ips: 60.3562 speech/sec, eta: 4 days, 20:18:06 [2023-03-25 03:18:41 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [400/9238], loss: 12.00418, learning_rate: 0.00015263, reader_cost: 0.1190, batch_cost: 0.1473, ips: 60.6106 speech/sec, eta: 4 days, 19:48:23 [2023-03-25 03:19:08 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [500/9238], loss: 12.60446, learning_rate: 0.00015260, reader_cost: 0.1188, batch_cost: 0.1483, ips: 59.2052 speech/sec, eta: 4 days, 22:32:52 [2023-03-25 03:19:34 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [600/9238], loss: 12.68887, learning_rate: 0.00015257, reader_cost: 0.1191, batch_cost: 0.1471, ips: 61.2517 speech/sec, eta: 4 days, 18:34:47 [2023-03-25 03:20:02 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [700/9238], loss: 11.23748, learning_rate: 0.00015254, reader_cost: 0.1197, batch_cost: 0.1476, ips: 58.3017 speech/sec, eta: 5 days, 0:22:11 [2023-03-25 03:20:29 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [800/9238], loss: 13.87281, learning_rate: 0.00015251, reader_cost: 0.1195, batch_cost: 0.1481, ips: 59.4306 speech/sec, eta: 4 days, 22:04:33 [2023-03-25 03:20:56 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [900/9238], loss: 12.38159, learning_rate: 0.00015248, reader_cost: 0.1196, batch_cost: 0.1483, ips: 59.2323 speech/sec, eta: 4 days, 22:27:49 [2023-03-25 03:21:23 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [1000/9238], loss: 12.18301, learning_rate: 0.00015245, reader_cost: 0.1199, batch_cost: 0.1484, ips: 58.6176 speech/sec, eta: 4 days, 23:41:54 [2023-03-25 03:21:50 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [1100/9238], loss: 12.64884, learning_rate: 0.00015243, reader_cost: 0.1198, batch_cost: 0.1485, ips: 59.8095 speech/sec, eta: 4 days, 21:18:19 [2023-03-25 03:22:18 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [1200/9238], loss: 12.37833, learning_rate: 0.00015240, reader_cost: 0.1199, batch_cost: 0.1494, ips: 57.0349 speech/sec, eta: 5 days, 3:00:15 [2023-03-25 03:22:45 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [1300/9238], loss: 12.80171, learning_rate: 0.00015237, reader_cost: 0.1202, batch_cost: 0.1491, ips: 59.4441 speech/sec, eta: 4 days, 22:00:42 [2023-03-25 03:23:13 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [1400/9238], loss: 15.01266, learning_rate: 0.00015234, reader_cost: 0.1205, batch_cost: 0.1496, ips: 57.0626 speech/sec, eta: 5 days, 2:55:45 [2023-03-25 03:23:41 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [1500/9238], loss: 12.42445, learning_rate: 0.00015231, reader_cost: 0.1210, batch_cost: 0.1497, ips: 57.3594 speech/sec, eta: 5 days, 2:17:07 [2023-03-25 03:24:08 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [1600/9238], loss: 11.98512, learning_rate: 0.00015229, reader_cost: 0.1211, batch_cost: 0.1497, ips: 58.5377 speech/sec, eta: 4 days, 23:48:58 [2023-03-25 03:24:36 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [1700/9238], loss: 12.19901, learning_rate: 0.00015226, reader_cost: 0.1214, batch_cost: 0.1496, ips: 58.4294 speech/sec, eta: 5 days, 0:01:51 [2023-03-25 03:25:03 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [1800/9238], loss: 12.16341, learning_rate: 0.00015223, reader_cost: 0.1218, batch_cost: 0.1495, ips: 57.8747 speech/sec, eta: 5 days, 1:10:24 [2023-03-25 03:25:31 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [1900/9238], loss: 13.00252, learning_rate: 0.00015220, reader_cost: 0.1221, batch_cost: 0.1497, ips: 56.8614 speech/sec, eta: 5 days, 3:19:30 [2023-03-25 03:25:59 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [2000/9238], loss: 11.33726, learning_rate: 0.00015217, reader_cost: 0.1224, batch_cost: 0.1495, ips: 58.3287 speech/sec, eta: 5 days, 0:12:54 [2023-03-25 03:26:26 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [2100/9238], loss: 12.16204, learning_rate: 0.00015214, reader_cost: 0.1227, batch_cost: 0.1494, ips: 58.3709 speech/sec, eta: 5 days, 0:07:13 [2023-03-25 03:26:54 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [2200/9238], loss: 11.68896, learning_rate: 0.00015212, reader_cost: 0.1230, batch_cost: 0.1493, ips: 58.0798 speech/sec, eta: 5 days, 0:42:54 [2023-03-25 03:27:22 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [2300/9238], loss: 11.79155, learning_rate: 0.00015209, reader_cost: 0.1233, batch_cost: 0.1494, ips: 56.5730 speech/sec, eta: 5 days, 3:55:20 [2023-03-25 03:27:50 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [2400/9238], loss: 12.33611, learning_rate: 0.00015206, reader_cost: 0.1235, batch_cost: 0.1494, ips: 57.5791 speech/sec, eta: 5 days, 1:44:57 [2023-03-25 03:28:18 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [2500/9238], loss: 12.39493, learning_rate: 0.00015203, reader_cost: 0.1236, batch_cost: 0.1495, ips: 57.5808 speech/sec, eta: 5 days, 1:44:16 [2023-03-25 03:28:45 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [2600/9238], loss: 12.18027, learning_rate: 0.00015200, reader_cost: 0.1238, batch_cost: 0.1493, ips: 58.3098 speech/sec, eta: 5 days, 0:12:29 [2023-03-25 03:29:13 INFO ] trainer:train_epoch:359 - Train epoch: [30/200], batch: [2700/9238], loss: 12.61896, learning_rate: 0.00015198, reader_cost: 0.1239, batch_cost: 0.1494, ips: 57.4356 speech/sec, eta: 5 days, 2:01:49 Exception in thread Thread-11: Traceback (most recent call last): File "C:\Users\fang\anaconda3\envs\newbb\lib\threading.py", line 932, in _bootstrap_inner self.run() File "C:\Users\fang\anaconda3\envs\newbb\lib\threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddle\fluid\dataloader\dataloader_iter.py", line 217, in _thread_loop Traceback (most recent call last): File "C:\Users\fang\Desktop\PPASR-develop\train.py", line 23, in batch = self._dataset_fetcher.fetch(indices, File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddle\fluid\dataloader\fetcher.py", line 125, in fetch data.append(self.dataset[idx]) File "C:\Users\fang\Desktop\PPASR-develop\ppasr\data_utils\reader.py", line 74, in getitem trainer.train(save_model_path=args.save_model_path, File "C:\Users\fang\Desktop\PPASR-develop\ppasr\trainer.py", line 508, in train feature = self._audio_featurizer.featurize(audio_segment) File "C:\Users\fang\Desktop\PPASR-develop\ppasr\data_utils\featurizer\audio_featurizer.py", line 64, in featurize return self._compute_fbank(samples=samples, File "C:\Users\fang\Desktop\PPASR-develop\ppasr\data_utils\featurizer\audio_featurizer.py", line 131, in _compute_fbank self.train_epoch(epoch_id=epoch_id, save_model_path=save_model_path, writer=writer, nranks=nranks) File "C:\Users\fang\Desktop\PPASR-develop\ppasr\trainer.py", line 315, in train_epoch mat = fbank(waveform, File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddleaudio-1.0.2-py3.8.egg\paddleaudio\compliance\kaldi.py", line 462, in fbank loss_dict = self.model(inputs, input_lens, labels, label_lens) File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddle\fluid\dygraph\layers.py", line 948, in call strided_input, signal_log_energy = _get_window( File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddleaudio-1.0.2-py3.8.egg\paddleaudio\compliance\kaldi.py", line 166, in _get_window return self.forward(*inputs, **kwargs) File "C:\Users\fang\Desktop\PPASR-develop\ppasr\model_utils\deepspeech2\model.py", line 58, in forward signal_log_energy = _get_log_energy(strided_input, epsilon, File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddleaudio-1.0.2-py3.8.egg\paddleaudio\compliance\kaldi.py", line 103, in _get_log_energy paddle.to_tensor(math.log(energy_floor), dtype=strided_input.dtype)) File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddle\tensor\creation.py", line 546, in to_tensor return _to_tensor_non_static(data, dtype, place, stop_gradient) File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddle\tensor\creation.py", line 405, in _to_tensor_non_static return core.eager.Tensor( OSError: (External) CUDA error(719), unspecified launch failure. [Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/groupCUDARTTYPES.html#groupCUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:259)

eouts, eouts_len, final_state_h_box, final_state_c_box = self.encoder(speech, speech_lengths, None, None)

File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddle\fluid\dygraph\layers.py", line 950, in call return self._dygraph_call_func(*inputs, kwargs) File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddle\fluid\dygraph\layers.py", line 935, in _dygraph_call_func outputs = self.forward(*inputs, *kwargs) File "C:\Users\fang\Desktop\PPASR-develop\ppasr\model_utils\deepspeech2\encoder.py", line 91, in forward x, final_state = self.rnn[i](x, init_state_list[i], x_lens) # [B, T, D] File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddle\fluid\dygraph\layers.py", line 950, in call return self._dygraph_call_func(inputs, kwargs) File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddle\fluid\dygraph\layers.py", line 935, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddle\nn\layer\rnn.py", line 1082, in forward return self._cudnn_impl(inputs, initial_states, sequence_length) File "C:\Users\fang\anaconda3\envs\newbb\lib\site-packages\paddle\nn\layer\rnn.py", line 1016, in _cudnnimpl , _, out, state = _legacy_C_ops.rnn( OSError: (External) CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED. [Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at ..\paddle\phi\kernels\gpu\rnn_kernel.cu.cc:396) [operator < rnn > error]

yeyupiaoling commented 1 year ago

应该是显存不足,减少batch size试试

navy7913 commented 1 year ago

您好,我有去尝试降低batch_size不过我找不到调整betch_size该在哪里调整,后来我找到的方法是在train.py底下增加一行add_arg('batch_size', int, 4, '训练得批量大小'),用了之后的确没有再发生错误,不过我训练了190个epoch我的loss永远保持在10几,都不会降低,想请教您一下该如何让loss降低的方法。

yeyupiaoling commented 1 year ago

batch_size在配置文件里面的。 loss不会降低,应该就是收敛结束了