Open Sayaya-27 opened 1 year ago
请问现在是不可以是用DDP进行训练吗,我自己也试了一下,在val阶段会有报错,是因为代码问题所导致的吗
可以用DDP训练,你报的啥错误?但是训练速度反而比单卡慢,精度也不如单卡训练的效果,可能是原版DDP代码部分写的有问题,我也没细看,建议还是DP单卡训练,有空我再优化多卡训练
Epoch gpu_mem box cls dfl labels img_size
0/149 14.6G 1.387 0.06326 0.063 19 1536: 100%|??????????| 4221/4221 [1:57:00<00:00, 1.66s/it]
Class Images Labels P R HBBmAP@.5 HBBmAP@.5:.95: 45%|????? | 806/1809 [30:07<29:59, 1.79s/it]
Failures:
Epoch gpu_mem box cls dfl labels img_size 0/149 14.6G 1.387 0.06326 0.063 19 1536: 100%|??????????| 4221/4221 [1:57:00<00:00, 1.66s/it] Class Images Labels P R HBBmAP@.5 HBBmAP@.5:.95: 45%|????? | 806/1809 [30:07<29:59, 1.79s/it]
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809156 milliseconds before timing out.
Class Images Labels P R HBBmAP@.5 HBBmAP@.5:.95: 45%|????? | 808/1809 [30:10<27:24, 1.64s/it] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809156 milliseconds before timing out. Class Images Labels P R HBBmAP@.5 HBBmAP@.5:.95: 45%|????? | 809/1809 [30:12<28:13, 1.69s/it] WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3116356 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 3116357) of binary: /home/server/anaconda3/envs/obb/bin/python Traceback (most recent call last): File "/home/server/anaconda3/envs/obb/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/server/anaconda3/envs/obb/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/run.py", line 723, in main() File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, kwargs) File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
Root Cause (first observed failure): [0]: time : 2023-09-12_11:15:18 host : server-S8030GM2NE rank : 1 (local_rank: 1) exitcode : -6 (pid: 3116357) error_file: traceback : Signal 6 (SIGABRT) received by PID 3116357 这个是是用DDP训练时的报错,对于start.md中介绍的方法,我没能成功运行,所以换用了yolov5介绍的方法,命令是python -m torch.distributed.run --nproc_per_node 2 train.py --cfg yolov8m-p6.yaml --imgsz 1536 --batch-size 6
我重新修改了单卡,多卡训练的部分,以及修正了loss部分的一个小bug,你可以重新拉取代码,然后看一下getstart.md里的训练命令
请问现在是不可以是用DDP进行训练吗,我自己也试了一下,在val阶段会有报错,是因为代码问题所导致的吗