Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ckpt): optimize model checkpointing in Volc and Ali #65

Merged
merged 1 commit into from
Mar 4, 2024

Conversation

season0528
Copy link
Collaborator

@season0528 season0528 commented Feb 29, 2024

Motivation

We re-submit this PR since PR in InternLM repo not merged yeat.

Due to python GIL, training throughput (measured by tgs) will be affected when we asynchronously upload ckpt via concurrent.futures.ThreadPoolExecutor.
Therefore, we switch to use concurrent.futures.ProcessPoolExecutor to async upload ckpts, which has nearly no overhead on the training performance.

Implementation

  1. For Aliyun OSS2 and Volc TOS, we switch to use concurrent.futures.ProcessPoolExecutor to upload checkpoints to object storage. Our comparative test results show that using multiple processes can reduce the overhead of async uploading to almost negligible.
  2. Use dill to hijack default pickle serialization/deserialization implementation in multiprocessing

BC-breaking (Optional)

None

Use cases (Optional)

We conducted a comparative test on the 65B model

  1. For Volc, we can see that training throughput (measured by tgs) will be improved from ~70 to 145, and the affected steps will be reduced from several dozens to one step.
  • When use ThreadPoolExecutor to async upload ckpt, the overhead is huge: the tgs of affected steps decay from 148 to ~70 and at least dozens of steps will be affected (Because python multithreading is fake multithreading thanks to GIL, causing CPU time to be occupied by async upload)

image

  • When use ProcessPoolExecutor to async upload ckpt, the overhead is very small: the tgs of affected steps decay from 148 to 145 and only one steps will be affected (overhead only comes from inter-process communication)

image

  1. Our implementation also works for Aliyun. We can see that training throughput (measured by tgs) will be improved from ~120 to 143, and the affected steps will be reduced from several dozens to one step.
  • When use ThreadPoolExecutor to async upload ckpt, the overhead is huge: the tgs of affected steps decay from 148 to ~120 and at least dozens of steps will be affected (Because python multithreading is fake multithreading thanks to GIL, causing CPU time to be occupied by async upload)

image

  • When use ProcessPoolExecutor to async upload ckpt, the overhead is very small: the tgs of affected steps decay from 150 to 143 and only one steps will be affected (overhead only comes from inter-process communication)

image

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

@season0528 season0528 force-pushed the feat/optimize-async-upload branch from 74c54f0 to dbb7f81 Compare March 1, 2024 09:17
@gaoyang07 gaoyang07 added the enhancement New feature or request label Mar 4, 2024
@gaoyang07 gaoyang07 merged commit e465142 into InternLM:develop Mar 4, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants