Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fused Optimizer #13

Open
3 tasks
xrsrke opened this issue Oct 25, 2023 · 8 comments
Open
3 tasks

Fused Optimizer #13

xrsrke opened this issue Oct 25, 2023 · 8 comments
Assignees

Comments

@xrsrke
Copy link
Owner

xrsrke commented Oct 25, 2023

Since our DistributedOptimizer takes another optimizer and turns it into ZeRO-1, can we make it do a fused optimizer like this? It should take an optimizer and turn it into a fused ZeRO-1 in a generic way.

APIs

from torch.optim import Adam
from pipegoose.optim import FusedOptim

optim = Adam(model.parameters(), lr=1e-3)
optim = FusedOptim(optim).fuse()

loss.backward()
optim.step()

TODO

  • Fused Adam
  • Fused SGD
  • Test all fused optimizers with DataParallel and ZeRO-1
@xrsrke xrsrke converted this from a draft issue Oct 25, 2023
@xrsrke xrsrke added the help wanted Extra attention is needed label Oct 25, 2023
@isamu-isozaki
Copy link

@xrsrke
Copy link
Owner Author

xrsrke commented Oct 26, 2023

@isamu-isozaki I was referring to a fused optimizer like FusedAdam from DeepSpeed (link). We fuse certain operations, such as element-wise operations, since these occupy the majority of the runtime during training.

Our goal is to enable the library to perform 3D parallelism in conjunction with DistributedOptimizer (ZeRO-1). We maintain a list of popular optimizers along with their fused versions. Then we create a mapping between a torch.optim.Optimizer and its corresponding fused version, which we subsequently feed to DistributedOptimizer. This is just one potential solution I have in mind :)

@isamu-isozaki
Copy link

@xrsrke I think this is definitely possible if we were to make a fused ver of each optimizer beforehand yup. For the above link it was mainly for just converting generic pytorch code to fused ver.
Then do you think this is pretty much the same issue as the porting cuda kernels issue?(or under it)

@xrsrke
Copy link
Owner Author

xrsrke commented Oct 26, 2023

@isamu-isozaki

Then do you think this is pretty much the same issue as the porting CUDA kernels issue?

Yes.

For the above link, it was mainly for just converting generic PyTorch code to a fused version.

Or maybe we could fuse the entire model after parallelizing it using (TensorParallel, PipelineParallel...)

Would you like to take on both issues (this one and the port CUDA kernel)? I will merge them both for you and assign them. Let me know if you need a GPU for testing, although any GPU could work for this, since we will just test the correctness of the fused version.

@isamu-isozaki
Copy link

@xrsrke sounds good. I think I can do the initial setup for how we want the cuda code formatted and some examples and then we can prob start accepting cuda kernel pr contributions for each optimizer

@xrsrke
Copy link
Owner Author

xrsrke commented Oct 26, 2023

Thank you. @isamu-isozaki also, if you look at those fused optimizers, the only thing that they do is replace one or a few operations with their fused one (do I miss something?), and keep everything else the same. So it'd be amazing if we could take an arbitrary optimizer, and only replace the operation that we have the fused one available, and keep everything else the same... so that if users have some tweaks in their optimizer, they still can do it. What do you think?

@isamu-isozaki
Copy link

isamu-isozaki commented Oct 26, 2023

@xrsrke I think I kind of get you but I think that'll lead to decreased performance since the more segmented it is=the more global reads/global writes which is the bottleneck for cuda performance. So overall, replacing everything with cuda to minimize read-writes tend to be the fastest(if cuda is optimized). For design, I'm thinking something like https://github.com/lucidrains/lion-pytorch but instead of triton cuda. (I'm mainly familiar with triton+optimizers where they pretty much just replace the main chunk with triton)

@xrsrke xrsrke moved this from Todo to In Progress in pipegoose v1 Oct 27, 2023
@xrsrke
Copy link
Owner Author

xrsrke commented Oct 27, 2023

"So overall, replacing everything with CUDA to minimize read-writes tend to be the fastest(if CUDA is optimized)."

@isamu-isozaki
That sounds good. If that yields better results, then go for it. Thank you.

@xrsrke xrsrke removed the help wanted Extra attention is needed label Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants