Asianfleet commented 8 months ago

when I run the following command:

python -m torch.distributed.launch --nproc_per_node=4 \ train_x_dreamer.py \ --config configs/cupcake_geometry.json \ --out-dir 'results/result_XDreamer/cupcake_geometry'

an error occured:

(XDreamer) admin1@ubuntu:/data/AIMH/pkgs/3D/Gen/X-Dreamer$ python -m torch.distributed.launch --nproc_per_node=4 \ --config configs/cupcake_geometry.json> train_x_dreamer.py \

    --config configs/cupcake_geometry.json \
    --out-dir 'results/result_XDreamer/cupcake_geometry'
/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -4) local_rank: 0 (pid: 1304815) of binary: /data/anaconda3/envs/XDreamer/bin/python Traceback (most recent call last): File "/data/anaconda3/envs/XDreamer/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/data/anaconda3/envs/XDreamer/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_x_dreamer.py FAILED

Failures: [1]: time : 2024-01-17_22:55:52 host : ubuntu rank : 1 (local_rank: 1) exitcode : -4 (pid: 1304816) error_file: <N/A> traceback : Signal 4 (SIGILL) received by PID 1304816 [2]: time : 2024-01-17_22:55:52 host : ubuntu rank : 2 (local_rank: 2) exitcode : -4 (pid: 1304817) error_file: <N/A> traceback : Signal 4 (SIGILL) received by PID 1304817 [3]: time : 2024-01-17_22:55:52 host : ubuntu rank : 3 (local_rank: 3) exitcode : -4 (pid: 1304818) error_file: <N/A> traceback : Signal 4 (SIGILL) received by PID 1304818

Root Cause (first observed failure): [0]: time : 2024-01-17_22:55:52 host : ubuntu rank : 0 (local_rank: 0) exitcode : -4 (pid: 1304815) error_file: <N/A> traceback : Signal 4 (SIGILL) received by PID 1304815

system:Ubuntu 20.04 cuda:11.4 GPU A100 40G

xmu-xiaoma666 commented 8 months ago

Hello, based on the information you provided, it seems to be an issue with multi-gpu operation.

For the warning of "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated"

I use 4 3090 GPUs to run this code, so I did not encounter similar errors. After searching for this problem on Google, I think you can solve this warning by following this https://github.com/facebookresearch/detr/issues/578

For the error you provided, I haven't encountered a similar situation while running the code. Can you provide detailed information about the leave or specific line where the error occurred?

Asianfleet commented 8 months ago

Hello, based on the information you provided, it seems to be an issue with multi-gpu operation.

For the warning of "/data/anaconda3/envs/XDreamer/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated"

I use 4 3090 GPUs to run this code, so I did not encounter similar errors. After searching for this problem on Google, I think you can solve this warning by following this facebookresearch/detr#578

For the error you provided, I haven't encountered a similar situation while running the code. Can you provide detailed information about the leave or specific line where the error occurred?

I didn't notice the multi-GPU configuration when I ran it. But I only have one GPU. Does your code work on one GPU?

xmu-xiaoma666 commented 8 months ago

If you have only one GPU, you can set the parameter '--nproc_per_node=1'. However, it's important to note that when you run this code using only one GPU, the generated results may be slightly inferior to what was reported in our paper. This is because each batch will contain only one perspective image for optimization.

Asianfleet commented 8 months ago

If you have only one GPU, you can set the parameter '--nproc_per_node=1'. However, it's important to note that when you run this code using only one GPU, the generated results may be slightly inferior to what was reported in our paper. This is because each batch will contain only one perspective image for optimization.

I tried to set the parameter '--nproc_per_node=1'，and received an error:

/data/AIMH/envs/XDreamer/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -4) local_rank: 0 (pid: 711252) of binary: /data/AIMH/envs/XDreamer/bin/python Traceback (most recent call last): File "/data/AIMH/envs/XDreamer/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/data/AIMH/envs/XDreamer/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/AIMH/envs/XDreamer/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/data/AIMH/envs/XDreamer/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/data/AIMH/envs/XDreamer/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/data/AIMH/envs/XDreamer/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/data/AIMH/envs/XDreamer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/AIMH/envs/XDreamer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_x_dreamer.py FAILED

Failures:

------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-01-21_18:16:10 host : ubuntu rank : 0 (local_rank: 0) exitcode : -4 (pid: 711252) error_file: traceback : Signal 4 (SIGILL) received by PID 711252 ======================================================

xmu-xiaoma666 commented 8 months ago

Because my computer has multiple GPUs, I may not be able to test what problems my method will encounter on a single GPU. Perhaps setting the FLAGS.multi_gpu in the code to “False” may solve your problem. If not resolved, it may be necessary to refactor the code to obtain a version that runs on a single GPU machine

Asianfleet commented 8 months ago

Because my computer has multiple GPUs, I may not be able to test what problems my method will encounter on a single GPU. Perhaps setting the FLAGS.multi_gpu in the code to “False” may solve your problem. If not resolved, it may be necessary to refactor the code to obtain a version that runs on a single GPU machine

Thanks！I will try it later.

Asianfleet commented 8 months ago

I changed FLAGS.multi_gpu to false in renderer and train_x_dreamer, but I still get the same error

xmu-xiaoma666 / X-Dreamer

ChildFailedError #2

train_x_dreamer.py FAILED

Root Cause (first observed failure): [0]: time : 2024-01-17_22:55:52 host : ubuntu rank : 0 (local_rank: 0) exitcode : -4 (pid: 1304815) error_file: <N/A> traceback : Signal 4 (SIGILL) received by PID 1304815

train_x_dreamer.py FAILED