feat(ckpt): optimize model checkpointing in Volc and Ali #65

season0528 · 2024-02-29T02:33:18Z

Motivation

We re-submit this PR since PR in InternLM repo not merged yeat.

Due to python GIL, training throughput (measured by tgs) will be affected when we asynchronously upload ckpt via concurrent.futures.ThreadPoolExecutor.
Therefore, we switch to use concurrent.futures.ProcessPoolExecutor to async upload ckpts, which has nearly no overhead on the training performance.

Implementation

For Aliyun OSS2 and Volc TOS, we switch to use concurrent.futures.ProcessPoolExecutor to upload checkpoints to object storage. Our comparative test results show that using multiple processes can reduce the overhead of async uploading to almost negligible.
Use dill to hijack default pickle serialization/deserialization implementation in multiprocessing

BC-breaking (Optional)

None

Use cases (Optional)

We conducted a comparative test on the 65B model

For Volc, we can see that training throughput (measured by tgs) will be improved from ~70 to 145, and the affected steps will be reduced from several dozens to one step.

When use ThreadPoolExecutor to async upload ckpt, the overhead is huge: the tgs of affected steps decay from 148 to ~70 and at least dozens of steps will be affected (Because python multithreading is fake multithreading thanks to GIL, causing CPU time to be occupied by async upload)

When use ProcessPoolExecutor to async upload ckpt, the overhead is very small: the tgs of affected steps decay from 148 to 145 and only one steps will be affected (overhead only comes from inter-process communication)

Our implementation also works for Aliyun. We can see that training throughput (measured by tgs) will be improved from ~120 to 143, and the affected steps will be reduced from several dozens to one step.

When use ThreadPoolExecutor to async upload ckpt, the overhead is huge: the tgs of affected steps decay from 148 to ~120 and at least dozens of steps will be affected (Because python multithreading is fake multithreading thanks to GIL, causing CPU time to be occupied by async upload)

When use ProcessPoolExecutor to async upload ckpt, the overhead is very small: the tgs of affected steps decay from 150 to 143 and only one steps will be affected (overhead only comes from inter-process communication)

Checklist

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
CLA has been signed and all committers have signed the CLA in this PR.

mm-assistant bot assigned sunpengsdu Feb 29, 2024

optimize model ckpt and reduce checkpointing overhead

dbb7f81

season0528 force-pushed the feat/optimize-async-upload branch from 74c54f0 to dbb7f81 Compare March 1, 2024 09:17

gaoyang07 approved these changes Mar 4, 2024

View reviewed changes

gaoyang07 added the enhancement New feature or request label Mar 4, 2024

gaoyang07 merged commit e465142 into InternLM:develop Mar 4, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ckpt): optimize model checkpointing in Volc and Ali #65

feat(ckpt): optimize model checkpointing in Volc and Ali #65

season0528 commented Feb 29, 2024 •

edited

Loading

feat(ckpt): optimize model checkpointing in Volc and Ali #65

feat(ckpt): optimize model checkpointing in Volc and Ali #65

Conversation

season0528 commented Feb 29, 2024 • edited Loading

Motivation

Implementation

BC-breaking (Optional)

Use cases (Optional)

Checklist

season0528 commented Feb 29, 2024 •

edited

Loading