feat(ckpt): optimize model checkpointing in Volc and Ali #65
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
We re-submit this PR since PR in InternLM repo not merged yeat.
Due to python GIL, training throughput (measured by tgs) will be affected when we asynchronously upload ckpt via
concurrent.futures.ThreadPoolExecutor
.Therefore, we switch to use
concurrent.futures.ProcessPoolExecutor
to async upload ckpts, which has nearly no overhead on the training performance.Implementation
concurrent.futures.ProcessPoolExecutor
to upload checkpoints to object storage. Our comparative test results show that using multiple processes can reduce the overhead of async uploading to almost negligible.dill
to hijack defaultpickle
serialization/deserialization implementation inmultiprocessing
BC-breaking (Optional)
None
Use cases (Optional)
We conducted a comparative test on the 65B model
ThreadPoolExecutor
to async upload ckpt, the overhead is huge: the tgs of affected steps decay from 148 to ~70 and at least dozens of steps will be affected (Because python multithreading is fake multithreading thanks to GIL, causing CPU time to be occupied by async upload)ProcessPoolExecutor
to async upload ckpt, the overhead is very small: the tgs of affected steps decay from 148 to 145 and only one steps will be affected (overhead only comes from inter-process communication)ThreadPoolExecutor
to async upload ckpt, the overhead is huge: the tgs of affected steps decay from 148 to ~120 and at least dozens of steps will be affected (Because python multithreading is fake multithreading thanks to GIL, causing CPU time to be occupied by async upload)ProcessPoolExecutor
to async upload ckpt, the overhead is very small: the tgs of affected steps decay from 150 to 143 and only one steps will be affected (overhead only comes from inter-process communication)Checklist
Before PR:
After PR: