xrsrke / pipegoose

Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
MIT License
80 stars 18 forks source link

Fused Optimizer #13

Open xrsrke opened 1 year ago

xrsrke commented 1 year ago

Since our DistributedOptimizer takes another optimizer and turns it into ZeRO-1, can we make it do a fused optimizer like this? It should take an optimizer and turn it into a fused ZeRO-1 in a generic way.

APIs

from torch.optim import Adam
from pipegoose.optim import FusedOptim

optim = Adam(model.parameters(), lr=1e-3)
optim = FusedOptim(optim).fuse()

loss.backward()
optim.step()

TODO

isamu-isozaki commented 1 year ago

https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747 kind of does what you say I think

xrsrke commented 1 year ago

@isamu-isozaki I was referring to a fused optimizer like FusedAdam from DeepSpeed (link). We fuse certain operations, such as element-wise operations, since these occupy the majority of the runtime during training.

Our goal is to enable the library to perform 3D parallelism in conjunction with DistributedOptimizer (ZeRO-1). We maintain a list of popular optimizers along with their fused versions. Then we create a mapping between a torch.optim.Optimizer and its corresponding fused version, which we subsequently feed to DistributedOptimizer. This is just one potential solution I have in mind :)

isamu-isozaki commented 1 year ago

@xrsrke I think this is definitely possible if we were to make a fused ver of each optimizer beforehand yup. For the above link it was mainly for just converting generic pytorch code to fused ver. Then do you think this is pretty much the same issue as the porting cuda kernels issue?(or under it)

xrsrke commented 1 year ago

@isamu-isozaki

Then do you think this is pretty much the same issue as the porting CUDA kernels issue?

Yes.

For the above link, it was mainly for just converting generic PyTorch code to a fused version.

Or maybe we could fuse the entire model after parallelizing it using (TensorParallel, PipelineParallel...)

Would you like to take on both issues (this one and the port CUDA kernel)? I will merge them both for you and assign them. Let me know if you need a GPU for testing, although any GPU could work for this, since we will just test the correctness of the fused version.

isamu-isozaki commented 1 year ago

@xrsrke sounds good. I think I can do the initial setup for how we want the cuda code formatted and some examples and then we can prob start accepting cuda kernel pr contributions for each optimizer

xrsrke commented 1 year ago

Thank you. @isamu-isozaki also, if you look at those fused optimizers, the only thing that they do is replace one or a few operations with their fused one (do I miss something?), and keep everything else the same. So it'd be amazing if we could take an arbitrary optimizer, and only replace the operation that we have the fused one available, and keep everything else the same... so that if users have some tweaks in their optimizer, they still can do it. What do you think?

isamu-isozaki commented 1 year ago

@xrsrke I think I kind of get you but I think that'll lead to decreased performance since the more segmented it is=the more global reads/global writes which is the bottleneck for cuda performance. So overall, replacing everything with cuda to minimize read-writes tend to be the fastest(if cuda is optimized). For design, I'm thinking something like https://github.com/lucidrains/lion-pytorch but instead of triton cuda. (I'm mainly familiar with triton+optimizers where they pretty much just replace the main chunk with triton)

xrsrke commented 1 year ago

"So overall, replacing everything with CUDA to minimize read-writes tend to be the fastest(if CUDA is optimized)."

@isamu-isozaki
That sounds good. If that yields better results, then go for it. Thank you.