Sayaya-27 commented 1 year ago

请问现在是不可以是用DDP进行训练吗，我自己也试了一下，在val阶段会有报错，是因为代码问题所导致的吗

yzqxy commented 1 year ago

请问现在是不可以是用DDP进行训练吗，我自己也试了一下，在val阶段会有报错，是因为代码问题所导致的吗

可以用DDP训练，你报的啥错误？但是训练速度反而比单卡慢，精度也不如单卡训练的效果，可能是原版DDP代码部分写的有问题，我也没细看，建议还是DP单卡训练，有空我再优化多卡训练

Sayaya-27 commented 1 year ago

 Epoch   gpu_mem       box       cls       dfl    labels  img_size
 0/149     14.6G     1.387   0.06326     0.063        19      1536: 100%|??????????| 4221/4221 [1:57:00<00:00,  1.66s/it]                                                                                                                                                                                                                            
           Class     Images     Labels          P          R  HBBmAP@.5  HBBmAP@.5:.95:  45%|?????     | 806/1809 [30:07<29:59,  1.79s/it]

[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809156 milliseconds before timing out. Class Images Labels P R HBBmAP@.5 HBBmAP@.5:.95: 45%|????? | 808/1809 [30:10<27:24, 1.64s/it]
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809156 milliseconds before timing out. Class Images Labels P R HBBmAP@.5 HBBmAP@.5:.95: 45%|????? | 809/1809 [30:12<28:13, 1.69s/it]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3116356 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 3116357) of binary: /home/server/anaconda3/envs/obb/bin/python Traceback (most recent call last): File "/home/server/anaconda3/envs/obb/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/server/anaconda3/envs/obb/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/run.py", line 723, in main() File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, kwargs) File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:

-------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-09-12_11:15:18 host : server-S8030GM2NE rank : 1 (local_rank: 1) exitcode : -6 (pid: 3116357) error_file: traceback : Signal 6 (SIGABRT) received by PID 3116357 这个是是用DDP训练时的报错，对于start.md中介绍的方法，我没能成功运行，所以换用了yolov5介绍的方法，命令是python -m torch.distributed.run --nproc_per_node 2 train.py --cfg yolov8m-p6.yaml --imgsz 1536 --batch-size 6

yzqxy commented 1 year ago

 Epoch   gpu_mem       box       cls       dfl    labels  img_size
 0/149     14.6G     1.387   0.06326     0.063        19      1536: 100%|??????????| 4221/4221 [1:57:00<00:00,  1.66s/it]                                                                                                                                                                                                                            
           Class     Images     Labels          P          R  HBBmAP@.5  HBBmAP@.5:.95:  45%|?????     | 806/1809 [30:07<29:59,  1.79s/it]                                                                                                                                                                                                           
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809156 milliseconds before timing out.

Class Images Labels P R HBBmAP@.5 HBBmAP@.5:.95: 45%|????? | 808/1809 [30:10<27:24, 1.64s/it] [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809156 milliseconds before timing out. Class Images Labels P R HBBmAP@.5 HBBmAP@.5:.95: 45%|????? | 809/1809 [30:12<28:13, 1.69s/it] WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3116356 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 3116357) of binary: /home/server/anaconda3/envs/obb/bin/python Traceback (most recent call last): File "/home/server/anaconda3/envs/obb/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/server/anaconda3/envs/obb/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/run.py", line 723, in main() File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, kwargs) File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/server/anaconda3/envs/obb/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
Root Cause (first observed failure): [0]: time : 2023-09-12_11:15:18 host : server-S8030GM2NE rank : 1 (local_rank: 1) exitcode : -6 (pid: 3116357) error_file: traceback : Signal 6 (SIGABRT) received by PID 3116357 这个是是用DDP训练时的报错，对于start.md中介绍的方法，我没能成功运行，所以换用了yolov5介绍的方法，命令是python -m torch.distributed.run --nproc_per_node 2 train.py --cfg yolov8m-p6.yaml --imgsz 1536 --batch-size 6

我重新修改了单卡，多卡训练的部分，以及修正了loss部分的一个小bug，你可以重新拉取代码，然后看一下getstart.md里的训练命令

yzqxy / Yolov8_obb_Prune_Track

多卡训练问题 #2

train.py FAILED

[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809156 milliseconds before timing out.

train.py FAILED

Failures: