Open xrsrke opened 1 year ago
https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747 kind of does what you say I think
@isamu-isozaki I was referring to a fused optimizer like FusedAdam
from DeepSpeed (link). We fuse certain operations, such as element-wise operations, since these occupy the majority of the runtime during training.
Our goal is to enable the library to perform 3D parallelism in conjunction with DistributedOptimizer (ZeRO-1). We maintain a list of popular optimizers along with their fused versions. Then we create a mapping between a torch.optim.Optimizer
and its corresponding fused version, which we subsequently feed to DistributedOptimizer. This is just one potential solution I have in mind :)
@xrsrke I think this is definitely possible if we were to make a fused ver of each optimizer beforehand yup. For the above link it was mainly for just converting generic pytorch code to fused ver. Then do you think this is pretty much the same issue as the porting cuda kernels issue?(or under it)
@isamu-isozaki
Then do you think this is pretty much the same issue as the porting CUDA kernels issue?
Yes.
For the above link, it was mainly for just converting generic PyTorch code to a fused version.
Or maybe we could fuse the entire model after parallelizing it using (TensorParallel, PipelineParallel...)
Would you like to take on both issues (this one and the port CUDA kernel)? I will merge them both for you and assign them. Let me know if you need a GPU for testing, although any GPU could work for this, since we will just test the correctness of the fused version.
@xrsrke sounds good. I think I can do the initial setup for how we want the cuda code formatted and some examples and then we can prob start accepting cuda kernel pr contributions for each optimizer
Thank you. @isamu-isozaki also, if you look at those fused optimizers, the only thing that they do is replace one or a few operations with their fused one (do I miss something?), and keep everything else the same. So it'd be amazing if we could take an arbitrary optimizer, and only replace the operation that we have the fused one available, and keep everything else the same... so that if users have some tweaks in their optimizer, they still can do it. What do you think?
@xrsrke I think I kind of get you but I think that'll lead to decreased performance since the more segmented it is=the more global reads/global writes which is the bottleneck for cuda performance. So overall, replacing everything with cuda to minimize read-writes tend to be the fastest(if cuda is optimized). For design, I'm thinking something like https://github.com/lucidrains/lion-pytorch but instead of triton cuda. (I'm mainly familiar with triton+optimizers where they pretty much just replace the main chunk with triton)
"So overall, replacing everything with CUDA to minimize read-writes tend to be the fastest(if CUDA is optimized)."
@isamu-isozaki
That sounds good. If that yields better results, then go for it. Thank you.
Since our DistributedOptimizer takes another optimizer and turns it into ZeRO-1, can we make it do a fused optimizer like this? It should take an optimizer and turn it into a fused ZeRO-1 in a generic way.
APIs
TODO