Closed wonkyoc closed 2 months ago
Hi, N is the number of devices. It's like pipeline parallelism and we divide parameters by layers to different devices.
How does PipeFusion store only 1/N parameters? (N is # of the patches)
The paper describes:
Regarding memory efficiency, each device in the PipeFusion setup stores only 1/N of the parameters relevant to its specific stage. Since the use of stale KV for attention computation requires that each device maintains the full spatial KV for the corresponding L/N layers of its stage, this overhead is significantly smaller than that of DistriFusion and diminishes as the number of devices increases.
Could you elaborate on this? What I understand is that each micro-step infers an individual patch as a way of tensor parallelism. For instance, at micro-step 0, device 0 loads parameters for patch 0. At micro-step 1, device 0 unloads the parameters for patch 0 and loads the parameters for patch 1. By doing this, at each micro-step, all devices can only store 1/N parameters.
Am I correct?
N is # of GPU. PipeFusion splits model parameters like Gpipe or other pipeline parallelism. For example, if you have 48 layers, and N is 4. 0-11 is on device 0 and 12-23 on device 1, etc.
I see. Thanks a lot!
How does PipeFusion store only 1/N parameters? (N is # of the patches)
The paper describes:
Could you elaborate on this? What I understand is that each micro-step infers an individual patch as a way of tensor parallelism. For instance, at micro-step 0, device 0 loads parameters for patch 0. At micro-step 1, device 0 unloads the parameters for patch 0 and loads the parameters for patch 1. By doing this, at each micro-step, all devices can only store 1/N parameters.
Am I correct?