xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
Apache License 2.0
441 stars 36 forks source link

how to run the PixArt-alpha/PixArt-XL-2-1024-MS with multiple gpus? #136

Closed lambda7xx closed 1 month ago

lambda7xx commented 1 month ago

my script is below

torchrun --nproc_per_node=2 examples/pixartalpha_example.py \
--model PixArt-alpha/PixArt-XL-2-1024-MS \
--height 2048 \
--width 2048 \
--pipefusion_parallel_degree 1 \
--num_inference_steps 20 \
--warmup_steps 0 \
--prompt "A small dog" \
--use_split_batch > pixart_example.log 2>&1

my error log is

Loading pipeline components...:  80%|████████  | 4/5 [00:08<00:02,  2.65s/it]
Loading pipeline components...:  80%|████████  | 4/5 [00:08<00:02,  2.08s/it]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/xxxx/PipeFusion/examples/pixartalpha_example.py", line 53, in <module>
[rank1]:     main()
[rank1]:   File "/home/xxxx/PipeFusion/examples/pixartalpha_example.py", line 15, in main
[rank1]:     pipe = PipeFuserPixArtAlphaPipeline.from_pretrained(
[rank1]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/pipefusion-0.2-py3.10.egg/pipefuser/pipelines/pipeline_pixart_alpha.py", line 54, in from_pretrained
[rank1]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/diffusers-0.29.0-py3.10.egg/diffusers/pipelines/pipeline_utils.py", line 881, in from_pretrained
[rank1]:     loaded_sub_model = load_sub_model(
[rank1]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/diffusers-0.29.0-py3.10.egg/diffusers/pipelines/pipeline_loading_utils.py", line 703, in load_sub_model
[rank1]:     loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
[rank1]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1825, in from_pretrained
[rank1]:     return cls._from_pretrained(
[rank1]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2044, in _from_pretrained
[rank1]:     raise ValueError(
[rank1]: ValueError: Non-consecutive added token '<pad>' found. Should have index 32100 but has index 0 in saved vocabulary.
E0720 13:06:14.886000 139679606892352 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2708514) of binary: /home/xxxx/anaconda3/envs/stable/bin/python3
Traceback (most recent call last):
  File "/home/xxxx/anaconda3/envs/stable/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.3.1', 'console_scripts', 'torchrun')())
  File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
lambda7xx commented 1 month ago

my script for 4 gpus

torchrun --nproc_per_node=4 examples/pixartalpha_example.py \
--model PixArt-alpha/PixArt-XL-2-1024-MS \
--height 2048 \
--width 2048 \
--pipefusion_parallel_degree 2 \
--num_inference_steps 20 \
--warmup_steps 0 \
--prompt "A small dog" \
--use_split_batch > pixart_example.log 2>&1

my error is

rank0]: Traceback (most recent call last):
[rank0]:   File "/home/xxxx/PipeFusion/examples/pixartalpha_example.py", line 53, in <module>
[rank0]:     main()
[rank0]:   File "/home/xxxx/PipeFusion/examples/pixartalpha_example.py", line 15, in main
[rank0]:     pipe = PipeFuserPixArtAlphaPipeline.from_pretrained(
[rank0]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/pipefusion-0.2-py3.10.egg/pipefuser/pipelines/pipeline_pixart_alpha.py", line 54, in from_pretrained
[rank0]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/diffusers-0.29.0-py3.10.egg/diffusers/pipelines/pipeline_utils.py", line 881, in from_pretrained
[rank0]:     loaded_sub_model = load_sub_model(
[rank0]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/diffusers-0.29.0-py3.10.egg/diffusers/pipelines/pipeline_loading_utils.py", line 703, in load_sub_model
[rank0]:     loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
[rank0]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1825, in from_pretrained
[rank0]:     return cls._from_pretrained(
[rank0]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2044, in _from_pretrained
[rank0]:     raise ValueError(
[rank0]: ValueError: Non-consecutive added token '<pad>' found. Should have index 32100 but has index 0 in saved vocabulary.

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s][rank1]: Traceback (most recent call last):
[rank1]:   File "/home/xxxx/PipeFusion/examples/pixartalpha_example.py", line 53, in <module>
[rank1]:     main()
[rank1]:   File "/home/xxxx/PipeFusion/examples/pixartalpha_example.py", line 15, in main
[rank1]:     pipe = PipeFuserPixArtAlphaPipeline.from_pretrained(
[rank1]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/pipefusion-0.2-py3.10.egg/pipefuser/pipelines/pipeline_pixart_alpha.py", line 54, in from_pretrained
[rank1]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/diffusers-0.29.0-py3.10.egg/diffusers/pipelines/pipeline_utils.py", line 881, in from_pretrained
[rank1]:     loaded_sub_model = load_sub_model(
[rank1]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/diffusers-0.29.0-py3.10.egg/diffusers/pipelines/pipeline_loading_utils.py", line 703, in load_sub_model
[rank1]:     loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
[rank1]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1825, in from_pretrained
[rank1]:     return cls._from_pretrained(
[rank1]:   File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2044, in _from_pretrained
[rank1]:     raise ValueError(
[rank1]: ValueError: Non-consecutive added token '<pad>' found. Should have index 32100 but has index 0 in saved vocabulary.
W0720 13:09:00.409000 139697658783552 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2712629 closing signal SIGTERM
W0720 13:09:00.410000 139697658783552 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2712630 closing signal SIGTERM
E0720 13:09:00.739000 139697658783552 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2712627) of binary: /home/xxxx/anaconda3/envs/stable/bin/python3
Traceback (most recent call last):
  File "/home/xxxx/anaconda3/envs/stable/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.3.1', 'console_scripts', 'torchrun')())
  File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/xxxx/anaconda3/envs/stable/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
examples/pixartalpha_example.py FAILED
------------------------------------------------------------
Failures:
lonngxiang commented 3 weeks ago

same error

lonngxiang commented 3 weeks ago

image