[sync] update from main #4166

ver217 · 2023-07-04T10:05:44Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

NA

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

update from main

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

…4069)

* refactor: adapt boost API in base and naive strategies * fix: initialize plugin after setup_distributed * fix: fix save_pretrained fn * refactor: adapt boost API in DDPStrategy * to: add _post_init check * to: fix ddp backward, modify ddp dataloader and unwrap * feat: adapt boost API in ColossalAIStrategy * fix: call setup_distributed before use get_current_device * fix: fix save_model and save_optimizer * test: remove save_sharded_optimizer test * style: apply formatter * fix: fix stage check and add comments * feat: allow dict type arg in strategy.prepare * to: temporarily remove lr_scheduler for testing * style: simplify init of ColossalAIStrategy * fix: fix lr_scheduler in sft and rm * style: modify comments * test: add train_prompts tests * fix: fix inference only case and use in train_prompts * test: skip failed tests in ci * style: fix CodeFactor check * fix: do not use model.to('cpu') with GeminiPlugin * test: enable colossalai_gemini tests * test: set CUDA_VISIBLE_DEVICES in ci * docs: add note

…unk_config_searching [gemini] Rename arguments in chunk configuration searching

* fix chat eval * fix utils * fix utils * add comment --------- Co-authored-by: Qianran Ma <qianranm@luchentech.com>

* copy resnet example * add pytest package * skip test_ci * skip test_ci * skip test_ci

)

* fix some typos and problems in doc * fix some typos and problems in doc * add doc test

* to: add SLTrainer * refactor: refactor RMTrainer and SFTTrainer * fix: fix init file * feat: remove on_learn_epoch fn as not used * fix: align with modified gemini arguments * to: add OnPolicyTrainer * revert: add _on_learn_epoch fn * refactor: refactor PPOTrainer * style: rename PPOTrainer argument * fix: align with modified PPO arguments * test: align with modified train_prompts arguments * chore: modify train_prompts * docs: align with modified arguments * fix: remove unnecessary output * fix: move dataloader to fit fn of SLTrainer * fix: move dataloader to fit fn of OnPolicyTrainer * fix: modify usage of prompt and pretrain dataloader

…#4094) * feat: remove on_learn_epoch fn as not used * revert: add _on_learn_epoch fn * to: remove the use of NaiveStrategy * test: remove NaiveStrategy tests * feat: remove NaiveStrategy * style: modify comments and params * feat: split ColossalAIStrategy into LowLevelZeroStrategy and GeminiStrategy * fix: remove naive * fix: align with modified colossal strategy * fix: fix ddp _try_init_dist arg

* feat: remove on_learn_epoch fn as not used * revert: add _on_learn_epoch fn * feat: remove NaiveStrategy * test: update train_prompts tests * fix: remove prepare_llama_tokenizer_and_embedding * test: add lora arg * feat: remove roberta support in train_prompts due to runtime errs * feat: remove deberta & roberta in rm as not used * test: remove deberta and roberta tests * feat: remove deberta and roberta models as not used * fix: remove calls to roberta * fix: remove prepare_llama_tokenizer_and_embedding * chore: update transformers version * docs: update transformers version * fix: fix actor inference * fix: fix ci * feat: change llama pad token to unk * revert: revert ddp setup_distributed * fix: change llama pad token to unk * revert: undo unnecessary changes * fix: use pip to install transformers

* init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example

…caitech#3816) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit

* [shardformer] refactored the user api * polish code

* update readme with modules content * remove img

…caitech#3856) * add dropout layer, add dropout test * modify seed manager as context manager * add a copy of col_nn.layer * add dist_crossentropy loss; separate module test * polish the code * fix dist crossentropy loss

…pcaitech#3883) * add gpt2 policy and modify shard and slicer to support * remove unused code * polish code

* add bert align test, fix dist loss bug * forward and backward align * add ignore index * add shardformer CI * add gather_output optional for user in shardconfig * update readme with optional gather_ouput * add dist crossentropy loss test, remove unused files * remove unused file * remove unused file * rename the file * polish code

* fix bug in slicer, add slicer unit test * add dropout test * use pid as dropout seed * updata dropout test with local pattern * ad todo

hpcaitech#3949) * add dist dropout in model * update docstring and bert policy with dropout * refactor basepolicy and sharded, update bert * update format * update gpt2 policy * update bert policy * remove unused code * update readme for new policy usage

* [shardformer] support module saving and loading * polish code

* add linearconv1d test * add linearconv1d test

* add layernorm to bert * add layernorm test * add layernorm test with load state dict * add use_mixedfusedLN in shard config * refactor policy to support fused_layernorm

* [test] fixed tests failed due to dtensor change * polish code

* [shardformer] shardformer support opt models * [shardformer] shardformer support opt models, fix * [shardformer] shardformer support opt models, fix * [shardformer] shardformer support opt models, fix

* first v of vit shardformer * keep vit * update * vit shard add vitattention vitlayer * update num head shard para * finish test for vit * add new_model_class & postprocess * add vit readme * delete old files & fix the conflict * fix sth

…itech#4126) * [shardformer] add benchmark of shardformer * [shardformer] add benchmark of shardformer

* [shardformer] refactored some doc and api * polish code

* [shardformer] made tensor parallelism configurable * polish code

…ch#4149)

hpcaitech#4157) Co-authored-by: github-actions <github-actions@github.com>

* [cluster] add process group mesh * [test] add process group mesh test * force sync

* [pipeline] add stage manager * [test] add pipeline stage manager test * [pipeline] add docstring for stage manager

* [pipeline] add p2p communication * [test] add p2p communication test * [test] add rerun decorator * [test] rename to avoid conflict

* [api] update optimizer wrapper to fit pipeline * [pipeline] add base schedule * [pipeline] add 1f1b schedule * [test] add pipeline schedule utils test * [pipeline] fix import

* add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt

FrankLeeeee and others added 30 commits June 22, 2023 14:41

[workflow] cover all public repositories in weekly report (hpcaitech#…

b463651

…4069)

[gemini] fix argument naming during chunk configuration searching

0bb0b48

Merge pull request hpcaitech#4056 from Fridge003/hotfix/fix_gemini_ch…

2c8ae37

…unk_config_searching [gemini] Rename arguments in chunk configuration searching

[chat]: fix chat evaluation possible bug (hpcaitech#4064)

e89b127

* fix chat eval * fix utils * fix utils * add comment --------- Co-authored-by: Qianran Ma <qianranm@luchentech.com>

[hotfix]fix argument naming in docs and examples (hpcaitech#4083)

4da324c

[testing] move pytest to be inside the function (hpcaitech#4087)

95e95b6

[examples] copy resnet example to image (hpcaitech#4090)

31dc302

* copy resnet example * add pytest package * skip test_ci * skip test_ci * skip test_ci

[workflow] added status check for test coverage workflow (hpcaitech#4106

1ee947f

)

fix hpcaitech#3852 path error (hpcaitech#4058)

2d40759

fix typo docs/ (hpcaitech#4033)

769cddc

[doc] update and revise some typos and errs in docs (hpcaitech#4107)

711e2b4

* fix some typos and problems in doc * fix some typos and problems in doc * add doc test

[nfc]fix ColossalaiOptimizer is not defined (hpcaitech#4122)

09fe9dc

fix CheckpointIndexFile is not defined (hpcaitech#4109)

7e46bc8

fix Tensor is not defined (hpcaitech#4129)

8abc877

[hotfix] fix import bug in checkpoint_io (hpcaitech#4142)

1350ece

[shardformer] updated readme (hpcaitech#3827)

235792f

[shardformer] refactored the user api (hpcaitech#3828)

4972e1f

* [shardformer] refactored the user api * polish code

[shardformer] update readme with modules implement doc (hpcaitech#3834)

c594dc2

* update readme with modules content * remove img

update README (hpcaitech#3909)

70173e3

[shardformer] add gpt2 policy and modify shard and slicer to support (h…

79f8d5d

…pcaitech#3883) * add gpt2 policy and modify shard and slicer to support * remove unused code * polish code

[shardformer] Unit test (hpcaitech#3928)

a731304

* fix bug in slicer, add slicer unit test * add dropout test * use pid as dropout seed * updata dropout test with local pattern * ad todo

FrankLeeeee and others added 26 commits July 4, 2023 16:05

[shardformer] support module saving and loading (hpcaitech#4062)

8eb09a4

* [shardformer] support module saving and loading * polish code

[shardformer] add linearconv1d test (hpcaitech#4067)

0803a61

* add linearconv1d test * add linearconv1d test

[shardformer] supported fused qkv checkpoint (hpcaitech#4073)

70c58cf

[shardformer] Add layernorm (hpcaitech#4072)

92f6791

* add layernorm to bert * add layernorm test * add layernorm test with load state dict * add use_mixedfusedLN in shard config * refactor policy to support fused_layernorm

[test] fixed tests failed due to dtensor change (hpcaitech#4082)

c4b1b65

* [test] fixed tests failed due to dtensor change * polish code

[shardformer] refactored layernorm (hpcaitech#4086)

d33a44e

[shardformer] shardformer support opt models (hpcaitech#4091)

ac80937

* [shardformer] shardformer support opt models * [shardformer] shardformer support opt models, fix * [shardformer] shardformer support opt models, fix * [shardformer] shardformer support opt models, fix

[shardformer] supported bloom model (hpcaitech#4098)

b1c2901

[shardformer] supported fused normalization (hpcaitech#4112)

f3b6aaa

[shardformer] integrate with data parallelism (hpcaitech#4103)

6a88bae

[shardformer] import huggingface implicitly (hpcaitech#4101)

44a190e

[shardformer] added embedding gradient check (hpcaitech#4124)

ae035d3

[shardformer] write an shardformer example with bert finetuning (hpca…

7f9b303

…itech#4126) * [shardformer] add benchmark of shardformer * [shardformer] add benchmark of shardformer

[shardformer] refactored some doc and api (hpcaitech#4137)

74257cb

* [shardformer] refactored some doc and api * polish code

[shardformer] made tensor parallelism configurable (hpcaitech#4144)

1fb0d95

* [shardformer] made tensor parallelism configurable * polish code

[shardformer] added development protocol for standardization (hpcaite…

89f45ed

…ch#4149)

[chat] removed cache file (hpcaitech#4155)

f447ca1

[format] applied code formatting on changed files in pull request 4152 (

c77b3b1

hpcaitech#4157) Co-authored-by: github-actions <github-actions@github.com>

fix some typo colossalai/shardformer (hpcaitech#4160)

2ac2404

[cli] hotfix launch command for multi-nodes (hpcaitech#4165)

1908caa

[cluster] add process group mesh (hpcaitech#4039)

3be0c35

* [cluster] add process group mesh * [test] add process group mesh test * force sync

[pipeline] add stage manager (hpcaitech#4093)

18c7539

* [pipeline] add stage manager * [test] add pipeline stage manager test * [pipeline] add docstring for stage manager

[pipeline] implement p2p communication (hpcaitech#4100)

5a467e9

* [pipeline] add p2p communication * [test] add p2p communication test * [test] add rerun decorator * [test] rename to avoid conflict

[pipeline] refactor 1f1b schedule (hpcaitech#4115)

9526f44

* [api] update optimizer wrapper to fit pipeline * [pipeline] add base schedule * [pipeline] add 1f1b schedule * [test] add pipeline schedule utils test * [pipeline] fix import

ver217 requested a review from FrankLeeeee July 4, 2023 10:05

FrankLeeeee approved these changes Jul 4, 2023

View reviewed changes

FrankLeeeee merged commit ef1f972 into hpcaitech:feature/pipeline Jul 4, 2023
22 of 26 checks passed

ver217 deleted the sync/main branch September 13, 2023 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sync] update from main #4166

[sync] update from main #4166

ver217 commented Jul 4, 2023 •

edited

Loading

[sync] update from main #4166

[sync] update from main #4166

Conversation

ver217 commented Jul 4, 2023 • edited Loading

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

ver217 commented Jul 4, 2023 •

edited

Loading