Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[optim] Distributed Adafactor #5484

Merged
merged 36 commits into from
Apr 30, 2024

Conversation

duanjunwen
Copy link
Member

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@duanjunwen duanjunwen requested a review from a team as a code owner March 21, 2024 06:40
README.md Outdated Show resolved Hide resolved
colossalai/nn/optimizer/adafactor.py Show resolved Hide resolved
colossalai/nn/optimizer/adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/adafactor.py Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
tests/test_optimizer/test_distributred_adafactor_optim.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
tests/test_optimizer/test_dist_adafactor.py Outdated Show resolved Hide resolved
tests/test_optimizer/test_dist_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
docs/source/en/features/distributed_adafactor.md Outdated Show resolved Hide resolved
tests/test_optimizer/test_dist_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
tests/test_optimizer/test_dist_adafactor.py Outdated Show resolved Hide resolved
tests/test_optimizer/test_dist_adafactor.py Outdated Show resolved Hide resolved
tests/test_optimizer/test_dist_adafactor.py Outdated Show resolved Hide resolved
tests/test_optimizer/test_dist_adafactor.py Outdated Show resolved Hide resolved
tests/test_optimizer/test_dist_adafactor.py Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
colossalai/nn/optimizer/distributed_adafactor.py Outdated Show resolved Hide resolved
docs/source/en/features/distributed_adafactor.md Outdated Show resolved Hide resolved
@duanjunwen
Copy link
Member Author

Issue in review: #5484 (comment) is fixed in 510d4c0;

@ver217 ver217 changed the title Distributed Adafactor [optim] Distributed Adafactor Apr 29, 2024
@ver217 ver217 merged commit 375e1b9 into hpcaitech:feature/dist-optim Apr 30, 2024
5 of 7 checks passed
Edenzzzz added a commit that referenced this pull request May 14, 2024
…5694)

* [feat] Add distributed lamb; minor fixes in DeviceMesh (#5476)

* init: add dist lamb; add debiasing for lamb

* dist lamb tester mostly done

* all tests passed

* add comments

* all tests passed. Removed debugging statements

* moved setup_distributed inside plugin. Added dist layout caching

* organize better

---------

Co-authored-by: Edenzzzz <[email protected]>

* [hotfix] Improve tester precision by removing ZeRO on vanilla lamb (#5576)

Co-authored-by: Edenzzzz <[email protected]>

* [optim] add distributed came (#5526)

* test CAME under LowLevelZeroOptimizer wrapper

* test CAME TP row and col pass

* test CAME zero pass

* came zero add master and worker param id convert

* came zero test pass

* came zero test pass

* test distributed came passed

* reform code, Modify some expressions and add comments

* minor fix of test came

* minor fix of dist_came and test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fix of dist_came and test

* rebase dist-optim

* rebase dist-optim

* fix remaining comments

* add test dist came using booster api

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [optim] Distributed Adafactor (#5484)

* [feature] solve conflict; update optimizer readme;

* [feature] update optimize readme;

* [fix] fix testcase;

* [feature] Add transformer-bert to testcase;solve a bug related to indivisible shape (induction in use_zero and tp is row parallel);

* [feature] Add transformers_bert model zoo in testcase;

* [feature] add user documentation to docs/source/feature.

* [feature] add API Reference & Sample to optimizer Readme; add state check for bert exam;

* [feature] modify user documentation;

* [fix] fix readme format issue;

* [fix] add zero=0 in testcase; cached augment in dict;

* [fix] fix percision issue;

* [feature] add distributed rms;

* [feature] remove useless comment in testcase;

* [fix] Remove useless test; open zero test; remove fp16 test in bert exam;

* [feature] Extract distributed rms function;

* [feature] add booster + lowlevelzeroPlugin in test;

* [feature] add Start_with_booster_API case in md; add Supporting Information in md;

* [fix] Also remove state movement in base adafactor;

* [feature] extract factor function;

* [feature] add LowLevelZeroPlugin test;

* [fix] add tp=False and zero=True in logic;

* [fix] fix use zero logic;

* [feature] add row residue logic in column parallel factor;

* [feature] add check optim state func;

* [feature] Remove duplicate logic;

* [feature] update optim state check func and percision test bug;

* [fix] update/fix optim state; Still exist percision issue;

* [fix] Add use_zero check in _rms; Add plugin support info in Readme; Add Dist Adafactor init Info;

* [feature] removed print & comments in utils;

* [feature] uodate Readme;

* [feature] add LowLevelZeroPlugin test with Bert model zoo;

* [fix] fix logic in _rms;

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [fix] remove comments in testcase;

* [feature] add zh-Han Readme;

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Feature] refractor dist came; fix percision error; add low level zero test with bert model zoo; (#5676)

* [feature] daily update;

* [fix] fix dist came;

* [feature] refractor dist came; fix percision error; add low level zero test with bert model zoo;

* [fix] open rms; fix low level zero test; fix dist came test function name;

* [fix] remove redundant test;

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Feature] Add Galore (Adam, Adafactor) and distributed GaloreAdamW8bit (#5570)

* init: add dist lamb; add debiasing for lamb

* dist lamb tester mostly done

* all tests passed

* add comments

* all tests passed. Removed debugging statements

* moved setup_distributed inside plugin. Added dist layout caching

* organize better

* update comments

* add initial distributed galore

* add initial distributed galore

* add galore set param utils; change setup_distributed interface

* projected grad precision passed

* basic precision tests passed

* tests passed; located svd precision issue in fwd-bwd; banned these tests

* Plugin DP + TP tests passed

* move get_shard_dim to d_tensor

* add comments

* remove useless files

* remove useless files

* fix zero typo

* improve interface

* remove moe changes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix import

* fix deepcopy

* update came & adafactor to main

* fix param map

* fix typo

---------

Co-authored-by: Edenzzzz <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Hotfix] Remove one buggy test case from dist_adafactor for now (#5692)


Co-authored-by: Edenzzzz <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

---------

Co-authored-by: Edenzzzz <[email protected]>
Co-authored-by: chongqichuizi875 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: duanjunwen <[email protected]>
Co-authored-by: Hongxin Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants