xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
Apache License 2.0
632 stars 52 forks source link

[Bug] Potential risk of getting stuck in PipeFusion #310

Open HOOLoLo opened 1 week ago

HOOLoLo commented 1 week ago

I have submitted a issue in pytorch: https://github.com/pytorch/pytorch/issues/138074 which describes the problem, hoping they will add a new interface of setting custom stream for communication.

This problem hasn't occurred so far because the send kernel of NCCL will ignore the recv kernel and complete, when the size of data is less than 64MB.

Do you guys know of any other solutions?

HOOLoLo commented 1 week ago

@feifeibear Can you help me make a double check of this logic? I am not quite familiar with this project.

feifeibear commented 1 week ago

Your code snippet in the issue is very helpful. But, can you also give us a run script to reproduce the error in xdit. Also what kind of GPU cluser are you using?

HOOLoLo commented 8 hours ago

@feifeibear Sorry, I was busy recently. It's hard to reproduce the error on gpu, because i can only change the output picture size to make size of the patch_latent bigger, and it will OOM to make the picture big enough to reproduce the error. I came up with an idea that we can pair up the ranks for send and recv and create group each pair to solve the problem, so the recv will not wait the send of the same rank. Here is a demo picture: {cdc3f97d-4f0a-435a-a5dc-35e966b31b65}