xinntao / Real-ESRGAN

Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.
BSD 3-Clause "New" or "Revised" License
27.95k stars 3.51k forks source link

How to train the model with double gpu? #780

Open kl402401 opened 5 months ago

kl402401 commented 5 months ago

I train the model with double gpu, but it get something wrong. why? ! CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=21 realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --auto_resume

train.py: error: unrecognized arguments: --local-rank=1 [2024-04-12 09:48:38,075] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 69084) of binary: /data/envs/geo_real_esrgan/bin/python Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in main() File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main launch(args) File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch run(args) File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/envs/geo_real_esrgan/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

realesrgan/train.py FAILED

Failures: [1]: time : 2024-04-12_09:48:38 host : geo517 rank : 1 (local_rank: 1) exitcode : 2 (pid: 69085) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-04-12_09:48:38 host : geo517 rank : 0 (local_rank: 0) exitcode : 2 (pid: 69084) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

kl402401 commented 5 months ago

buy the way,I have two gpu cards

kl402401 commented 5 months ago

solve: CUDA_VISIBLE_DEVICES=0,1 \ python -m torch.distributed.launch --nproc_per_node=2 --master_port=4321 realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --launcher pytorch --auto_resume change CUDA_VISIBLE_DEVICES=0,1 \ torchrun --nproc_per_node=2 --master_port=4321 realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --launcher pytorch --auto_resume

torchrun replace python -m torch.distributed.launch