xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
Apache License 2.0
469 stars 40 forks source link

Cannot allocate memory #190

Closed lonngxiang closed 1 month ago

lonngxiang commented 1 month ago

is gpu not enough ?

File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/transformers/modeling_utils.py", line 549, in load_state_dict [rank6]: with safe_open(checkpoint_file, framework="pt") as f: [rank6]: RuntimeError: unable to mmap 9989150328 bytes from file </ai/PixArt-XL-2-1024-MS/text_encoder/model-00001-of-00002.safetensors>: Cannot allocate memory (12)

Eigensystem commented 1 month ago

It seems that you don't have enough memory to open the model. Can you run the model with diffusers?

lonngxiang commented 1 month ago

It seems that you don't have enough memory to open the model. Can you run the model with diffusers?

yes, 4090

lonngxiang commented 1 month ago

now this error happen image

torchrun --nproc_per_node=2 examples/pixartalpha_example.py --model /ai/PixArt-XL-2-1024-MS --pipefusion_parallel_degree 2 --ulysses_degree 2 --num_inference_steps 20 --warmup_steps 0 --prompt "A small dog" --use_cfg_parallel

lonngxiang commented 1 month ago

W0813 04:41:10.994219 139702199338816 torch/distributed/run.py:779] W0813 04:41:10.994219 139702199338816 torch/distributed/run.py:779] W0813 04:41:10.994219 139702199338816 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0813 04:41:10.994219 139702199338816 torch/distributed/run.py:779] /home/anaconda3/envs/llm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /home/anaconda3/envs/llm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") /home/anaconda3/envs/llm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /home/anaconda3/envs/llm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") WARNING 08-13 04:41:13 [args.py:143] Distributed environment is not initialized. Initializing... DEBUG 08-13 04:41:13 [parallel_state.py:141] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl WARNING 08-13 04:41:14 [args.py:143] Distributed environment is not initialized. Initializing... DEBUG 08-13 04:41:14 [parallel_state.py:141] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl INFO 08-13 04:41:14 [config.py:90] Ring degree not set, using default value 1 INFO 08-13 04:41:14 [config.py:126] Pipeline patch number not set, using default value 2 rank0: Traceback (most recent call last): rank0: File "/ai/xDiT/examples/pixartalpha_example.py", line 69, in

rank0: File "/ai/xDiT/examples/pixartalpha_example.py", line 19, in main rank0: engine_config, input_config = engine_args.create_config() rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/xfuser/config/args.py", line 160, in create_config rank0: parallel_config = ParallelConfig( rank0: File "", line 7, in init rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/xfuser/config/config.py", line 163, in __post_init__ rank0: assert parallel_world_size == world_size, ( rank0: AssertionError: parallel_world_size 8 must be equal to world_size 2 INFO 08-13 04:41:14 [config.py:90] Ring degree not set, using default value 1 INFO 08-13 04:41:14 [config.py:126] Pipeline patch number not set, using default value 2 rank1: Traceback (most recent call last): rank1: File "/ai/xDiT/examples/pixartalpha_example.py", line 69, in

rank1: File "/ai/xDiT/examples/pixartalpha_example.py", line 19, in main rank1: engine_config, input_config = engine_args.create_config() rank1: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/xfuser/config/args.py", line 160, in create_config rank1: parallel_config = ParallelConfig( rank1: File "", line 7, in init rank1: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/xfuser/config/config.py", line 163, in post_init__ rank1: assert parallel_world_size == world_size, ( rank1: AssertionError: parallel_world_size 8 must be equal to world_size 2 W0813 04:41:14.712880 139702199338816 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 12876 closing signal SIGTERM E0813 04:41:14.744440 139702199338816 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 12875) of binary: /home/anaconda3/envs/llm/bin/python3.10 Traceback (most recent call last): File "/home/anaconda3/envs/llm/bin/torchrun", line 8, in sys.exit(main()) File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init__.py", line 348, in wrapper return f(*args, **kwargs) File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

examples/pixartalpha_example.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-13_04:41:14 host : localhost.localdomain rank : 0 (local_rank: 0) exitcode : 1 (pid: 12875) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
feifeibear commented 1 month ago

You must make sure cfg x pipefusion x ulysses x ring = gpu_num.

The following cmd is valid:

torchrun --nproc_per_node=2 examples/pixartalpha_example.py --model /ai/PixArt-XL-2-1024-MS --pipefusion_parallel_degree 1 --ulysses_degree 1 --num_inference_steps 20 --warmup_steps 0 --prompt "A small dog" --use_cfg_parallel

torchrun --nproc_per_node=2 examples/pixartalpha_example.py --model /ai/PixArt-XL-2-1024-MS --pipefusion_parallel_degree 2 --ulysses_degree 1 --num_inference_steps 20 --warmup_steps 0 --prompt "A small dog"

torchrun --nproc_per_node=2 examples/pixartalpha_example.py --model /ai/PixArt-XL-2-1024-MS --pipefusion_parallel_degree 1 --ulysses_degree 2 --num_inference_steps 20 --warmup_steps 0 --prompt "A small dog"

lonngxiang commented 1 month ago

You must make sure cfg x pipefusion x ulysses x ring = gpu_num.

The following cmd is valid:

torchrun --nproc_per_node=2 examples/pixartalpha_example.py --model /ai/PixArt-XL-2-1024-MS --pipefusion_parallel_degree 1 --ulysses_degree 1 --num_inference_steps 20 --warmup_steps 0 --prompt "A small dog" --use_cfg_parallel

torchrun --nproc_per_node=2 examples/pixartalpha_example.py --model /ai/PixArt-XL-2-1024-MS --pipefusion_parallel_degree 2 --ulysses_degree 1 --num_inference_steps 20 --warmup_steps 0 --prompt "A small dog"

torchrun --nproc_per_node=2 examples/pixartalpha_example.py --model /ai/PixArt-XL-2-1024-MS --pipefusion_parallel_degree 1 --ulysses_degree 2 --num_inference_steps 20 --warmup_steps 0 --prompt "A small dog"

tks,it works; but what is cfg and ring?

Eigensystem commented 1 month ago

--use_cfg_parallel and --ring_degree Ring degree defaults to 1. If --use_cfg_parallel is set, cfg is 2, otherwise 1 You can refer to https://github.com/xdit-project/xDiT/blob/main/docs/methods/cfg_parallel.md & https://github.com/xdit-project/xDiT/blob/main/docs/methods/usp.md. ring_degree represents ring attention degree.