Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass correct node size for ZeRO++ #4085

Merged
merged 4 commits into from
Aug 9, 2023
Merged

Conversation

cmikeh2
Copy link
Contributor

@cmikeh2 cmikeh2 commented Aug 3, 2023

No description provided.

@cmikeh2 cmikeh2 enabled auto-merge August 3, 2023 21:18
Copy link
Contributor

@samadejacobs samadejacobs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmikeh2 cmikeh2 added this pull request to the merge queue Aug 9, 2023
Merged via the queue into master with commit f0463b4 Aug 9, 2023
16 checks passed
hughpu pushed a commit to hughpu/DeepSpeed that referenced this pull request Aug 9, 2023
* Pass correct node size

* formatting

---------

Co-authored-by: Connor Holmes <development@cmikeh2.me>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
github-merge-queue bot pushed a commit that referenced this pull request Aug 29, 2023
…port inputs require no grad (#4118)

* feat: add `non_reentrant_checkpoint`

* feat: add missing output postprocess and change the hook to record leaf forward tensor refs

* fix: make the multi_grad_hook registered after graph construction

* fix: backward compatibility for multi_tensor_hook

* fix: nonlocal reference error of deepspeed_saved_tensors

* fix: reduce repeating hook registration

* test: add test for `activation_checkpointing.checkpointing.non_reentrant_checkpoint`

* Pass correct node size for ZeRO++ (#4085)

* Pass correct node size

* formatting

---------

Co-authored-by: Connor Holmes <development@cmikeh2.me>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* add deepspeed chat arxiv report (#4110)

* add deepspeed chat arxiv report

* add zeroquant v2 and fp

* add selective enhencement

* add ignore for 'Youn' in spell checker

---------

Co-authored-by: yaozhewei <zheweiy@berkeley.edu>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* style: change flake8 detected style missmatch

* test: hack to clone the `test_activation_checkpointing` module for reuse and add regression tests

* doc: explain the introduction of `non_reentrant_checkpoint`

* doc: explain the test of `non_reentrant_checkpoint`

---------

Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Connor Holmes <development@cmikeh2.me>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: yaozhewei <zheweiy@berkeley.edu>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
github-merge-queue bot pushed a commit that referenced this pull request Sep 6, 2023
… true for activation checkpoint layer in pipeline train. (#4224)

* feat: add `non_reentrant_checkpoint`

* feat: add missing output postprocess and change the hook to record leaf forward tensor refs

* fix: make the multi_grad_hook registered after graph construction

* fix: backward compatibility for multi_tensor_hook

* fix: nonlocal reference error of deepspeed_saved_tensors

* fix: reduce repeating hook registration

* test: add test for `activation_checkpointing.checkpointing.non_reentrant_checkpoint`

* Pass correct node size for ZeRO++ (#4085)

* Pass correct node size

* formatting

---------

Co-authored-by: Connor Holmes <development@cmikeh2.me>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* add deepspeed chat arxiv report (#4110)

* add deepspeed chat arxiv report

* add zeroquant v2 and fp

* add selective enhencement

* add ignore for 'Youn' in spell checker

---------

Co-authored-by: yaozhewei <zheweiy@berkeley.edu>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* style: change flake8 detected style missmatch

* test: hack to clone the `test_activation_checkpointing` module for reuse and add regression tests

* doc: explain the introduction of `non_reentrant_checkpoint`

* doc: explain the test of `non_reentrant_checkpoint`

* apply non_reentrant_checkpoint in pipeline parallel training

* ut pass

* fix ci

* reduce check level for ci

---------

Co-authored-by: hughpu <hughpu@hotmail.com>
Co-authored-by: Hugh Pu <31498041+hughpu@users.noreply.github.com>
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Connor Holmes <development@cmikeh2.me>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: yaozhewei <zheweiy@berkeley.edu>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants