Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sync] update from main #4166

Merged
merged 76 commits into from
Jul 4, 2023
Merged

Conversation

ver217
Copy link
Member

@ver217 ver217 commented Jul 4, 2023

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

NA

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

update from main

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

FrankLeeeee and others added 30 commits June 22, 2023 14:41
* refactor: adapt boost API in base and naive strategies

* fix: initialize plugin after setup_distributed

* fix: fix save_pretrained fn

* refactor: adapt boost API in DDPStrategy

* to: add _post_init check

* to: fix ddp backward, modify ddp dataloader and unwrap

* feat: adapt boost API in ColossalAIStrategy

* fix: call setup_distributed before use get_current_device

* fix: fix save_model and save_optimizer

* test: remove save_sharded_optimizer test

* style: apply formatter

* fix: fix stage check and add comments

* feat: allow dict type arg in strategy.prepare

* to: temporarily remove lr_scheduler for testing

* style: simplify init of ColossalAIStrategy

* fix: fix lr_scheduler in sft and rm

* style: modify comments

* test: add train_prompts tests

* fix: fix inference only case and use in train_prompts

* test: skip failed tests in ci

* style: fix CodeFactor check

* fix: do not use model.to('cpu') with GeminiPlugin

* test: enable colossalai_gemini tests

* test: set CUDA_VISIBLE_DEVICES in ci

* docs: add note
…unk_config_searching

[gemini] Rename arguments in chunk configuration searching
* fix chat eval

* fix utils

* fix utils

* add comment

---------

Co-authored-by: Qianran Ma <qianranm@luchentech.com>
* copy resnet example

* add pytest package

* skip test_ci

* skip test_ci

* skip test_ci
* fix some typos and problems in doc

* fix some typos and problems in doc

* add doc test
* to: add SLTrainer

* refactor: refactor RMTrainer and SFTTrainer

* fix: fix init file

* feat: remove on_learn_epoch fn as not used

* fix: align with modified gemini arguments

* to: add OnPolicyTrainer

* revert: add _on_learn_epoch fn

* refactor: refactor PPOTrainer

* style: rename PPOTrainer argument

* fix: align with modified PPO arguments

* test: align with modified train_prompts arguments

* chore: modify train_prompts

* docs: align with modified arguments

* fix: remove unnecessary output

* fix: move dataloader to fit fn of SLTrainer

* fix: move dataloader to fit fn of OnPolicyTrainer

* fix: modify usage of prompt and pretrain dataloader
…#4094)

* feat: remove on_learn_epoch fn as not used

* revert: add _on_learn_epoch fn

* to: remove the use of NaiveStrategy

* test: remove NaiveStrategy tests

* feat: remove NaiveStrategy

* style: modify comments and params

* feat: split ColossalAIStrategy into LowLevelZeroStrategy and GeminiStrategy

* fix: remove naive

* fix: align with modified colossal strategy

* fix: fix ddp _try_init_dist arg
* feat: remove on_learn_epoch fn as not used

* revert: add _on_learn_epoch fn

* feat: remove NaiveStrategy

* test: update train_prompts tests

* fix: remove prepare_llama_tokenizer_and_embedding

* test: add lora arg

* feat: remove roberta support in train_prompts due to runtime errs

* feat: remove deberta & roberta in rm as not used

* test: remove deberta and roberta tests

* feat: remove deberta and roberta models as not used

* fix: remove calls to roberta

* fix: remove prepare_llama_tokenizer_and_embedding

* chore: update transformers version

* docs: update transformers version

* fix: fix actor inference

* fix: fix ci

* feat: change llama pad token to unk

* revert: revert ddp setup_distributed

* fix: change llama pad token to unk

* revert: undo unnecessary changes

* fix: use pip to install transformers
* init shardformer code structure

* add implement of sharder (inject and replace)

* add implement of replace layer to colossal layer

* separate different layer policy, add some notion

* implement 1d and 2d slicer, can tell col or row

* fix bug when slicing and inject model

* fix some bug; add inference test example
…caitech#3816)

* init shardformer code structure

* add implement of sharder (inject and replace)

* add implement of replace layer to colossal layer

* separate different layer policy, add some notion

* implement 1d and 2d slicer, can tell col or row

* fix bug when slicing and inject model

* fix some bug; add inference test example

* add share weight and train example

* add train

* add docstring and readme

* add docstring for other files

* pre-commit
* [shardformer] refactored the user api

* polish code
* update readme with modules content

* remove img
…caitech#3856)

* add dropout layer, add dropout test

* modify seed manager as context manager

* add a copy of col_nn.layer

* add dist_crossentropy loss; separate module test

* polish the code

* fix dist crossentropy loss
…pcaitech#3883)

* add gpt2 policy and modify shard and slicer to support

* remove unused code

* polish code
* add bert align test, fix dist loss bug

* forward and backward align

* add ignore index

* add shardformer CI

* add gather_output optional for user in shardconfig

* update readme with optional gather_ouput

* add dist crossentropy loss test, remove unused files

* remove unused file

* remove unused file

* rename the file

* polish code
* fix bug in slicer, add slicer unit test

* add dropout test

* use pid as dropout seed

* updata dropout test with local pattern

* ad todo
hpcaitech#3949)

* add dist dropout in model

* update docstring and bert policy with dropout

* refactor basepolicy and sharded, update bert

* update format

* update gpt2 policy

* update bert policy

* remove unused code

* update readme for new policy usage
FrankLeeeee and others added 26 commits July 4, 2023 16:05
* [shardformer] support module saving and loading

* polish code
* add linearconv1d test

* add linearconv1d test
* add layernorm to bert

* add layernorm test

* add layernorm test with load state dict

* add use_mixedfusedLN in shard config

* refactor policy to support fused_layernorm
* [test] fixed tests failed due to dtensor change

* polish code
* [shardformer] shardformer support opt models

* [shardformer] shardformer support opt models, fix

* [shardformer] shardformer support opt models, fix

* [shardformer] shardformer support opt models, fix
* first v of vit shardformer

* keep vit

* update

* vit shard add vitattention vitlayer

* update num head shard para

* finish test for vit

* add new_model_class & postprocess

* add vit readme

* delete old files & fix the conflict

* fix sth
…itech#4126)

* [shardformer] add benchmark of shardformer

* [shardformer] add benchmark of shardformer
* [shardformer] refactored some doc and api

* polish code
* [shardformer] made tensor parallelism configurable

* polish code
hpcaitech#4157)

Co-authored-by: github-actions <github-actions@github.com>
* [cluster] add process group mesh

* [test] add process group mesh test

* force sync
* [pipeline] add stage manager

* [test] add pipeline stage manager test

* [pipeline] add docstring for stage manager
* [pipeline] add p2p communication

* [test] add p2p communication test

* [test] add rerun decorator

* [test] rename to avoid conflict
* [api] update optimizer wrapper to fit pipeline

* [pipeline] add base schedule

* [pipeline] add 1f1b schedule

* [test] add pipeline schedule utils test

* [pipeline] fix import
* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt
@ver217 ver217 requested a review from FrankLeeeee July 4, 2023 10:05
@FrankLeeeee FrankLeeeee merged commit ef1f972 into hpcaitech:feature/pipeline Jul 4, 2023
22 of 26 checks passed
@ver217 ver217 deleted the sync/main branch September 13, 2023 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.