-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fused Optimizer #13
Comments
https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747 |
@isamu-isozaki I was referring to a fused optimizer like Our goal is to enable the library to perform 3D parallelism in conjunction with DistributedOptimizer (ZeRO-1). We maintain a list of popular optimizers along with their fused versions. Then we create a mapping between a |
@xrsrke I think this is definitely possible if we were to make a fused ver of each optimizer beforehand yup. For the above link it was mainly for just converting generic pytorch code to fused ver. |
Yes.
Or maybe we could fuse the entire model after parallelizing it using (TensorParallel, PipelineParallel...) Would you like to take on both issues (this one and the port CUDA kernel)? I will merge them both for you and assign them. Let me know if you need a GPU for testing, although any GPU could work for this, since we will just test the correctness of the fused version. |
@xrsrke sounds good. I think I can do the initial setup for how we want the cuda code formatted and some examples and then we can prob start accepting cuda kernel pr contributions for each optimizer |
Thank you. @isamu-isozaki also, if you look at those fused optimizers, the only thing that they do is replace one or a few operations with their fused one (do I miss something?), and keep everything else the same. So it'd be amazing if we could take an arbitrary optimizer, and only replace the operation that we have the fused one available, and keep everything else the same... so that if users have some tweaks in their optimizer, they still can do it. What do you think? |
@xrsrke I think I kind of get you but I think that'll lead to decreased performance since the more segmented it is=the more global reads/global writes which is the bottleneck for cuda performance. So overall, replacing everything with cuda to minimize read-writes tend to be the fastest(if cuda is optimized). For design, I'm thinking something like https://github.com/lucidrains/lion-pytorch but instead of triton cuda. (I'm mainly familiar with triton+optimizers where they pretty much just replace the main chunk with triton) |
@isamu-isozaki |
Since our DistributedOptimizer takes another optimizer and turns it into ZeRO-1, can we make it do a fused optimizer like this? It should take an optimizer and turn it into a fused ZeRO-1 in a generic way.
APIs
TODO
The text was updated successfully, but these errors were encountered: