xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
Apache License 2.0
667 stars 54 forks source link

Tensor Parallelism with Stale Activations to Reduce Communication Overhead #116

Closed SaudxInu closed 4 months ago

SaudxInu commented 4 months ago

Hi,

Pipefusion - 'Considering DiT’s affinity to Large Language Models (LLMs), both tensor parallelism and sequence parallelism, which are commonly utilized for efficient inference in LLMs, can be adapted for DiT. Nevertheless, diffusion models have longer sequence lengths and smaller model sizes, yet the communication overhead remains substantial during inference.'

DistriFusion - 'Tensor parallelism, in particular, has been widely adopted for accelerating LLMs, which are characterized by their substantial model sizes, whereas their activation sizes are relatively small. In such scenarios, the communication overhead introduced by tensor parallelism is relatively minor compared to the substantial latency benefits brought by increased memory bandwidth. However, the situation differs for diffusion models, which are generally smaller than LLMs but are often bottlenecked by the large activation size due to the spatial dimensions, especially when generating high-resolution content. The communication overhead from tensor parallelism becomes a significant factor, overshadowing the actual computation time.'

Have you considered investigating tensor parallelism with stale activations? This method might address the challenge of large activation sizes and minimize communication overhead to align with computation time. It could be valuable to implement and include this as a baseline in your research.

PS: Am I missing something trivial?

@feifeibear

feifeibear commented 4 months ago

I believe combining Stable Activation with TP is not a good idea.

Firstly, TP communicates both Activation and Parameter, and using the Input Temporal Redundancy feature does not reduce the communication volume of Parameters. Secondly, PipeFusion splits parameters, and in terms of reducing parameter memory, it functions similarly to TP. If TP also uses stale activation, its must use full spatial shape KVs for every layer, therefor its memory efficiency is not as high as PipeFusion.

In summary, TP combined with stale activation is not a promising research direction.

SaudxInu commented 4 months ago

Tensor parallelism just communicates activations and not parameters.

Hmmmm. I need to think more about it.

Is it possible to implement it just for sanity check.

feifeibear commented 4 months ago

I apologize for my previous incoherent remarks, and I have reconsidered your question.

TP (Tensor Parallelism) uses column-wise Linear + row-wise Linear + AllReduce, for both attn and mlp modules in DiT blocks.

Alternatively, it can be implemented as AllGather + column-wise Linear + row-wise Linear + ReduceScatter.

AllReduce and ReduceScatter cannot be modified to be asynchronous because they sum partial activations of the full spatial shape from different machines, and partial activations cannot be replaced with stable activations.

Looking forwards to your feedback

SaudxInu commented 4 months ago

For the TP (Tensor Parallelism) uses column-wise Linear + row-wise Linear + AllReduce, for both attn and mlp modules in DiT blocks we can do a async all gather and then sum, right?

Am I missing some implementation nuisances because of which doing this might not be possible?

feifeibear commented 4 months ago

You can use a feature with partial stale values from the previous diffusion step. If you use async allreduce (you may think it can be implemented with allgather and sum, which is not right.) you use all the stale values.

feifeibear commented 4 months ago

I close this issue. Feel free to contact me if you need more discussion.

SaudxInu commented 4 months ago

I close this issue. Feel free to contact me if you need more discussion.

Thanks for the active participation. Much appreciated.