-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[optim] Distributed Adafactor #5484
Merged
ver217
merged 36 commits into
hpcaitech:feature/dist-optim
from
duanjunwen:dist_adafactor
Apr 30, 2024
Merged
[optim] Distributed Adafactor #5484
ver217
merged 36 commits into
hpcaitech:feature/dist-optim
from
duanjunwen:dist_adafactor
Apr 30, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ver217
requested changes
Mar 21, 2024
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 7, 2024 07:50
34e20c2
to
4e9a571
Compare
ver217
reviewed
Apr 7, 2024
ver217
reviewed
Apr 7, 2024
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 8, 2024 09:38
67e4754
to
b75ac58
Compare
…ivisible shape (induction in use_zero and tp is row parallel);
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 9, 2024 09:16
606c619
to
020ed54
Compare
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 9, 2024 10:03
dc3f8dd
to
ce58adf
Compare
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 10, 2024 07:00
f0b9d6b
to
40a5528
Compare
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 10, 2024 07:23
c3262d9
to
1c9bb93
Compare
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 10, 2024 07:26
4e1d2af
to
1039f34
Compare
ver217
reviewed
Apr 10, 2024
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 10, 2024 12:06
f10ae73
to
2ffca49
Compare
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 11, 2024 03:14
5673a72
to
0fd62a0
Compare
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 11, 2024 07:35
2dff732
to
28c3a40
Compare
ver217
reviewed
Apr 11, 2024
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 13, 2024 16:37
aae5406
to
02ea83e
Compare
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 14, 2024 13:07
5753b0d
to
fb14125
Compare
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 15, 2024 08:00
14af45b
to
2dc0341
Compare
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 15, 2024 09:37
131f9e4
to
3bca491
Compare
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 16, 2024 07:58
dff7ba3
to
1357dd1
Compare
ver217
reviewed
Apr 16, 2024
…Add Dist Adafactor init Info;
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 17, 2024 03:11
0ac3a09
to
1038b23
Compare
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 17, 2024 03:34
97bb328
to
87746ec
Compare
ver217
reviewed
Apr 17, 2024
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 18, 2024 06:46
01d0f95
to
510d4c0
Compare
Issue in review: #5484 (comment) is fixed in 510d4c0; |
duanjunwen
force-pushed
the
dist_adafactor
branch
from
April 18, 2024 07:55
4a28efe
to
0a7f682
Compare
ver217
approved these changes
Apr 29, 2024
Edenzzzz
added a commit
that referenced
this pull request
May 14, 2024
…5694) * [feat] Add distributed lamb; minor fixes in DeviceMesh (#5476) * init: add dist lamb; add debiasing for lamb * dist lamb tester mostly done * all tests passed * add comments * all tests passed. Removed debugging statements * moved setup_distributed inside plugin. Added dist layout caching * organize better --------- Co-authored-by: Edenzzzz <[email protected]> * [hotfix] Improve tester precision by removing ZeRO on vanilla lamb (#5576) Co-authored-by: Edenzzzz <[email protected]> * [optim] add distributed came (#5526) * test CAME under LowLevelZeroOptimizer wrapper * test CAME TP row and col pass * test CAME zero pass * came zero add master and worker param id convert * came zero test pass * came zero test pass * test distributed came passed * reform code, Modify some expressions and add comments * minor fix of test came * minor fix of dist_came and test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix of dist_came and test * rebase dist-optim * rebase dist-optim * fix remaining comments * add test dist came using booster api --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [optim] Distributed Adafactor (#5484) * [feature] solve conflict; update optimizer readme; * [feature] update optimize readme; * [fix] fix testcase; * [feature] Add transformer-bert to testcase;solve a bug related to indivisible shape (induction in use_zero and tp is row parallel); * [feature] Add transformers_bert model zoo in testcase; * [feature] add user documentation to docs/source/feature. * [feature] add API Reference & Sample to optimizer Readme; add state check for bert exam; * [feature] modify user documentation; * [fix] fix readme format issue; * [fix] add zero=0 in testcase; cached augment in dict; * [fix] fix percision issue; * [feature] add distributed rms; * [feature] remove useless comment in testcase; * [fix] Remove useless test; open zero test; remove fp16 test in bert exam; * [feature] Extract distributed rms function; * [feature] add booster + lowlevelzeroPlugin in test; * [feature] add Start_with_booster_API case in md; add Supporting Information in md; * [fix] Also remove state movement in base adafactor; * [feature] extract factor function; * [feature] add LowLevelZeroPlugin test; * [fix] add tp=False and zero=True in logic; * [fix] fix use zero logic; * [feature] add row residue logic in column parallel factor; * [feature] add check optim state func; * [feature] Remove duplicate logic; * [feature] update optim state check func and percision test bug; * [fix] update/fix optim state; Still exist percision issue; * [fix] Add use_zero check in _rms; Add plugin support info in Readme; Add Dist Adafactor init Info; * [feature] removed print & comments in utils; * [feature] uodate Readme; * [feature] add LowLevelZeroPlugin test with Bert model zoo; * [fix] fix logic in _rms; * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [fix] remove comments in testcase; * [feature] add zh-Han Readme; --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feature] refractor dist came; fix percision error; add low level zero test with bert model zoo; (#5676) * [feature] daily update; * [fix] fix dist came; * [feature] refractor dist came; fix percision error; add low level zero test with bert model zoo; * [fix] open rms; fix low level zero test; fix dist came test function name; * [fix] remove redundant test; * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feature] Add Galore (Adam, Adafactor) and distributed GaloreAdamW8bit (#5570) * init: add dist lamb; add debiasing for lamb * dist lamb tester mostly done * all tests passed * add comments * all tests passed. Removed debugging statements * moved setup_distributed inside plugin. Added dist layout caching * organize better * update comments * add initial distributed galore * add initial distributed galore * add galore set param utils; change setup_distributed interface * projected grad precision passed * basic precision tests passed * tests passed; located svd precision issue in fwd-bwd; banned these tests * Plugin DP + TP tests passed * move get_shard_dim to d_tensor * add comments * remove useless files * remove useless files * fix zero typo * improve interface * remove moe changes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix import * fix deepcopy * update came & adafactor to main * fix param map * fix typo --------- Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Hotfix] Remove one buggy test case from dist_adafactor for now (#5692) Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> --------- Co-authored-by: Edenzzzz <[email protected]> Co-authored-by: chongqichuizi875 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: duanjunwen <[email protected]> Co-authored-by: Hongxin Liu <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📌 Checklist before creating the PR
[doc/gemini/tensor/...]: A concise description
🚨 Issue number
📝 What does this PR do?
💥 Checklist before requesting a review
⭐️ Do you enjoy contributing to Colossal-AI?
Tell us more if you don't enjoy contributing to Colossal-AI.