Closed wangcc57 closed 2 years ago
是不是音频太长了,太长就用长语音识别。 太短也会报这个错误。
test_long.wav这个有3分钟都可以,我测试那个50秒就不行,我试试用长语音
test_long.wav这个你用短语音识别也正常吗?不应该的。太长直接输入肯定会显存不足的
正常的,P40直接跑test_long.wav不报错
你换成长语音识别怎么样?
长语音可以,这就不科学了,难道我的wav有毒吗 :(
不可能吧,我还是觉得是因为音频太长的。因为输入太长也是类似的错误。
不过问题不大,反正也是要截成单句处理的,Wenet数据集你训练过吗?需要什么样的资源要多久?
我测试了下Wenet数据集训练,发现第一个epoch的loss一直在涨,是哪里姿势不对吗?
没训练过,第一个epoch训练时,数据集是从短到长排序的,所以这种情况是正常的。
我还发现双卡比单卡还慢,有的时候还会卡住.
不应该,你看剩余时间的吗?你看哪个日志?
控制台没东西,然后models目录也没有保存的临时参数,都几个小时了,单卡都会存几次的
多卡的日志是再log目录下的,对应每个卡的日志。
你看看是不是正常训练。
[2022-07-15 09:37:00.937632] 训练数据:14660592 [2022-07-15 09:37:38.068987] Train epoch: [1/65], batch: [0/229071], loss: 85.23431, learning rate: 0.00005000, eta: 6398 days, 7:53:18 [2022-07-15 09:41:46.882346] Train epoch: [1/65], batch: [100/229071], loss: 14.01040, learning rate: 0.00005000, eta: 428 days, 9:31:19 [2022-07-15 09:45:55.890805] Train epoch: [1/65], batch: [200/229071], loss: 17.09803, learning rate: 0.00005000, eta: 427 days, 20:36:51 [2022-07-15 09:50:07.060608] Train epoch: [1/65], batch: [300/229071], loss: 16.94832, learning rate: 0.00005000, eta: 431 days, 13:10:36
[2022-07-15 09:31:09.362761] 训练数据:14660592
下面这个是单卡的log [2022-07-15 09:55:03.052075] 训练数据:14660592 [2022-07-15 09:55:10.405565] Train epoch: [1/65], batch: [0/458143], loss: 85.80699, learning rate: 0.00005000, eta: 2534 days, 2:46:47 [2022-07-15 09:55:25.979604] Train epoch: [1/65], batch: [100/458143], loss: 15.74516, learning rate: 0.00005000, eta: 53 days, 14:39:28 [2022-07-15 09:55:41.944422] Train epoch: [1/65], batch: [200/458143], loss: 13.97528, learning rate: 0.00005000, eta: 54 days, 22:16:02 [2022-07-15 09:55:58.001585] Train epoch: [1/65], batch: [300/458143], loss: 18.45344, learning rate: 0.00005000, eta: 55 days, 5:36:10
woker1没打日志,而且比单卡慢很多,我去爬下你的代码
这么奇怪?你的两个卡一样的吗?
是不是数据预处理跟不上?
必须一样的,专门插了2个一样的卡
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:04:00.0 Off | Off |
| N/A 43C P0 51W / 250W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:83:00.0 Off | Off |
| N/A 41C P0 50W / 250W | 0MiB / 24576MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
GPU核心都100%了,数据应该没问题的
2张卡都是100%吗?
你分别用单卡测试2卡试试?你的卡1没有日志输出,有可能是它导致整体变慢的。
我试试
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:04:00.0 Off | Off |
| N/A 39C P0 50W / 250W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:83:00.0 Off | Off |
| N/A 53C P0 200W / 250W | 4601MiB / 24576MiB | 95% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 12426 C python3 4599MiB |
+-----------------------------------------------------------------------------+
[2022-07-15 10:44:59.195437] 训练数据:14660592
[2022-07-15 10:45:04.379965] Train epoch: [1/65], batch: [0/458143], loss: 88.21230, learning rate: 0.00005000, eta: 1786 days, 13:44:50
[2022-07-15 10:45:17.785773] Train epoch: [1/65], batch: [100/458143], loss: 15.96248, learning rate: 0.00005000, eta: 46 days, 3:21:53
[2022-07-15 10:45:30.826052] Train epoch: [1/65], batch: [200/458143], loss: 13.98875, learning rate: 0.00005000, eta: 44 days, 20:11:12
[2022-07-15 10:45:43.916232] Train epoch: [1/65], batch: [300/458143], loss: 18.24256, learning rate: 0.00005000, eta: 45 days, 0:27:07
[2022-07-15 10:45:57.125810] Train epoch: [1/65], batch: [400/458143], loss: 17.17925, learning rate: 0.00005000, eta: 45 days, 10:49:58
[2022-07-15 10:46:10.207853] Train epoch: [1/65], batch: [500/458143], loss: 12.79642, learning rate: 0.00005000, eta: 45 days, 0:00:46
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:04:00.0 Off | Off |
| N/A 51C P0 154W / 250W | 4601MiB / 24576MiB | 97% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:83:00.0 Off | Off |
| N/A 47C P0 51W / 250W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 12622 C python3 4599MiB |
+-----------------------------------------------------------------------------+
[2022-07-15 10:48:46.283469] 训练数据:14660592
[2022-07-15 10:48:52.645588] Train epoch: [1/65], batch: [0/458143], loss: 87.22278, learning rate: 0.00005000, eta: 2192 days, 3:52:18
[2022-07-15 10:49:05.040063] Train epoch: [1/65], batch: [100/458143], loss: 15.78344, learning rate: 0.00005000, eta: 42 days, 15:40:24
[2022-07-15 10:49:17.719734] Train epoch: [1/65], batch: [200/458143], loss: 14.05792, learning rate: 0.00005000, eta: 43 days, 15:09:58
[2022-07-15 10:49:30.346088] Train epoch: [1/65], batch: [300/458143], loss: 18.16205, learning rate: 0.00005000, eta: 43 days, 10:08:28
分开单独跑都正常,多卡一起跑就慢还卡住
这么神奇?我使用一般会快一半的时间。你试试设置读取数据线程16试试看、你CPU应该不少于16核吧? https://github.com/yeyupiaoling/PPASR/blob/e4ed0f821cfdaaf55741650b4db601cce53378c0/train.py#L10
这个我已经改成32了,CPU是32核的
你其他的并行任务正常吗?你的nccl版本是多少?
nccl-local-repo-rhel7-2.12.12-cuda11.7-1.0-1.x86_64 我在怀疑cuda的版本问题,GPU显示是11.6的但是装了11.7的也能用,我换成11.6试试
好
还是一样的,GPU跑满,但是很慢
这也太奇怪了。你直接用deepspeech试试看。
你在其他的多卡任务中,有没有正常?
服务器周末维护了,还没测试.
突然想起你单卡和双卡用的是同一个模型吗?要查看你完整训练一个epoch所用的时间。
big模型使用工程里自带的wav没问题,换了个我自己的wav报错,是需要更新最新代码吗?