xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
Apache License 2.0
558 stars 50 forks source link

# of parameters on each device #135

Closed wonkyoc closed 2 months ago

wonkyoc commented 2 months ago

How does PipeFusion store only 1/N parameters? (N is # of the patches)

The paper describes:

Regarding memory efficiency, each device in the PipeFusion setup stores only 1/N of the parameters relevant to its specific stage. Since the use of stale KV for attention computation requires that each device maintains the full spatial KV for the corresponding L/N layers of its stage, this overhead is significantly smaller than that of DistriFusion and diminishes as the number of devices increases.

Could you elaborate on this? What I understand is that each micro-step infers an individual patch as a way of tensor parallelism. For instance, at micro-step 0, device 0 loads parameters for patch 0. At micro-step 1, device 0 unloads the parameters for patch 0 and loads the parameters for patch 1. By doing this, at each micro-step, all devices can only store 1/N parameters.

Am I correct?

Steaunk commented 2 months ago

Hi, N is the number of devices. It's like pipeline parallelism and we divide parameters by layers to different devices.

feifeibear commented 2 months ago

How does PipeFusion store only 1/N parameters? (N is # of the patches)

The paper describes:

Regarding memory efficiency, each device in the PipeFusion setup stores only 1/N of the parameters relevant to its specific stage. Since the use of stale KV for attention computation requires that each device maintains the full spatial KV for the corresponding L/N layers of its stage, this overhead is significantly smaller than that of DistriFusion and diminishes as the number of devices increases.

Could you elaborate on this? What I understand is that each micro-step infers an individual patch as a way of tensor parallelism. For instance, at micro-step 0, device 0 loads parameters for patch 0. At micro-step 1, device 0 unloads the parameters for patch 0 and loads the parameters for patch 1. By doing this, at each micro-step, all devices can only store 1/N parameters.

Am I correct?

N is # of GPU. PipeFusion splits model parameters like Gpipe or other pipeline parallelism. For example, if you have 48 layers, and N is 4. 0-11 is on device 0 and 12-23 on device 1, etc.

wonkyoc commented 2 months ago

I see. Thanks a lot!