xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
Apache License 2.0
441 stars 36 forks source link

SD3-pipefusion 1 GPU error #105

Closed feifeibear closed 1 month ago

feifeibear commented 2 months ago

deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message) INFO 06-26 09:59:48 [distri_dit_sd3_pipefusion.py:49] Using pipeline parallelism, world_size: 1 and n_device_per_batch: 1 DistriSD3Pipeline from pretrain stage 1 0.0 GB Loading pipeline components...: 22%|████████████████▉ | 2/9 [00:00<00:01, 6.46it/s]You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.24it/s] Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████| 9/9 [00:01<00:00, 7.08it/s] DistriSD3Pipeline from pretrain stage 2 17.6923648 GB 0%| | 0/3 00:00<?, ?it/s: Traceback (most recent call last): rank0: File "/mnt/fjr/distrifuser-DiT/scripts/sd3_example.py", line 224, in

rank0: File "/mnt/fjr/distrifuser-DiT/scripts/sd3_example.py", line 147, in main rank0: pipeline = DistriSD3Pipeline.from_pretrained( rank0: File "/mnt/fjr/distrifuser-DiT/pipefuser/pipelines/sd3.py", line 114, in from_pretrained rank0: ret = DistriSD3Pipeline(pipeline, distri_config) rank0: File "/mnt/fjr/distrifuser-DiT/pipefuser/pipelines/sd3.py", line 43, in init

rank0: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/mnt/fjr/distrifuser-DiT/pipefuser/pipelines/sd3.py", line 161, in prepare

rank0: File "/mnt/fjr/distrifuser-DiT/pipefuser/pipelines/pip/distri_sd3.py", line 299, in call

rank0: File "/mnt/fjr/distrifuser-DiT/pipefuser/utils.py", line 298, in irecv_from_prev

rank0: File "/mnt/fjr/distrifuser-DiT/pipefuser/utils.py", line 256, in first_recv_from_prev rank0: self.recv_shape = self.recv_shape_comm() rank0: File "/mnt/fjr/distrifuser-DiT/pipefuser/utils.py", line 231, in recv_shape_comm rank0: dist.recv(dim, src=self.prev_rank, group=self.extra_group if is_extra else None) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper rank0: return func(*args, **kwargs) rank0: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1929, in recv rank0: pg.recv([tensor], src, tag).wait() rank0: RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details) E0626 09:59:56.969000 140689274761664 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 53339) of binary: /usr/bin/python

foreverpiano commented 1 month ago

@feifeibear sd3现在是还有问题吗?是什么原因跑不起来呢?

Steaunk commented 1 month ago

@feifeibear sd3现在是还有问题吗?是什么原因跑不起来呢?

这个是因为没有SD3还没有把特殊情况device=1进行特殊判断。但是可以选择跑diffusers的sd3 pipeline。

feifeibear commented 1 month ago

@feifeibear sd3现在是还有问题吗?是什么原因跑不起来呢?

https://github.com/PipeFusion/PipeFusion/blob/main/lagecy/scripts/sd3_example.py 这个脚本可以用的。

foreverpiano commented 1 month ago

SD3和其他模型是哪里有区别呢?是jointattention那个部分挂掉了吗 @Steaunk

Steaunk commented 1 month ago

SD3和其他模型是哪里有区别呢?是jointattention那个部分挂掉了吗 @Steaunk

不是,这部分其实没有区别,其他模型在device=1的时候也是特判了一下,只是SD3还没加上。

foreverpiano commented 1 month ago

哦哦好的,所以是小bug能修的是吧?现在master分支直接report error了

feifeibear commented 1 month ago

哦哦好的,所以是小bug能修的是吧?现在master分支直接report error了

你能直接贴issue么,如果是gpu=1的问题,单gpu不用pp就行。 我们已经重构了API,新API支持pixart-alpha/sigma,SD3 WIP。 lagecy API的bug你贴一下,影响使用会修的