Closed Asianfleet closed 6 months ago
Hello, based on the information you provided, it seems to be an issue with multi-gpu operation.
For the warning of "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated"
I use 4 3090 GPUs to run this code, so I did not encounter similar errors. After searching for this problem on Google, I think you can solve this warning by following this https://github.com/facebookresearch/detr/issues/578
For the error you provided, I haven't encountered a similar situation while running the code. Can you provide detailed information about the leave or specific line where the error occurred?
Hello, based on the information you provided, it seems to be an issue with multi-gpu operation.
For the warning of "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated"
I use 4 3090 GPUs to run this code, so I did not encounter similar errors. After searching for this problem on Google, I think you can solve this warning by following this facebookresearch/detr#578
For the error you provided, I haven't encountered a similar situation while running the code. Can you provide detailed information about the leave or specific line where the error occurred?
I didn't notice the multi-GPU configuration when I ran it. But I only have one GPU. Does your code work on one GPU?
If you have only one GPU, you can set the parameter '--nproc_per_node=1'. However, it's important to note that when you run this code using only one GPU, the generated results may be slightly inferior to what was reported in our paper. This is because each batch will contain only one perspective image for optimization.
If you have only one GPU, you can set the parameter '--nproc_per_node=1'. However, it's important to note that when you run this code using only one GPU, the generated results may be slightly inferior to what was reported in our paper. This is because each batch will contain only one perspective image for optimization.
I tried to set the parameter '--nproc_per_node=1',and received an error:
/data/AIMH/envs/XDreamer/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
Failures:
Because my computer has multiple GPUs, I may not be able to test what problems my method will encounter on a single GPU. Perhaps setting the FLAGS.multi_gpu in the code to “False” may solve your problem. If not resolved, it may be necessary to refactor the code to obtain a version that runs on a single GPU machine
Because my computer has multiple GPUs, I may not be able to test what problems my method will encounter on a single GPU. Perhaps setting the FLAGS.multi_gpu in the code to “False” may solve your problem. If not resolved, it may be necessary to refactor the code to obtain a version that runs on a single GPU machine
Thanks!I will try it later.
I changed FLAGS.multi_gpu to false in renderer and train_x_dreamer, but I still get the same error
when I run the following command:
python -m torch.distributed.launch --nproc_per_node=4 \ train_x_dreamer.py \ --config configs/cupcake_geometry.json \ --out-dir 'results/result_XDreamer/cupcake_geometry'
an error occured:
(XDreamer) admin1@ubuntu:/data/AIMH/pkgs/3D/Gen/X-Dreamer$ python -m torch.distributed.launch --nproc_per_node=4 \ --config configs/cupcake_geometry.json> train_x_dreamer.py \
warnings.warn( WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -4) local_rank: 0 (pid: 1304815) of binary: /data/anaconda3/envs/XDreamer/bin/python Traceback (most recent call last): File "/data/anaconda3/envs/XDreamer/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/data/anaconda3/envs/XDreamer/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_x_dreamer.py FAILED
Failures: [1]: time : 2024-01-17_22:55:52 host : ubuntu rank : 1 (local_rank: 1) exitcode : -4 (pid: 1304816) error_file: <N/A> traceback : Signal 4 (SIGILL) received by PID 1304816 [2]: time : 2024-01-17_22:55:52 host : ubuntu rank : 2 (local_rank: 2) exitcode : -4 (pid: 1304817) error_file: <N/A> traceback : Signal 4 (SIGILL) received by PID 1304817 [3]: time : 2024-01-17_22:55:52 host : ubuntu rank : 3 (local_rank: 3) exitcode : -4 (pid: 1304818) error_file: <N/A> traceback : Signal 4 (SIGILL) received by PID 1304818
Root Cause (first observed failure): [0]: time : 2024-01-17_22:55:52 host : ubuntu rank : 0 (local_rank: 0) exitcode : -4 (pid: 1304815) error_file: <N/A> traceback : Signal 4 (SIGILL) received by PID 1304815
system:Ubuntu 20.04 cuda:11.4 GPU A100 40G