Improve parallel process of universal checkpoint conversion #5343

tohtana · 2024-04-01T08:23:30Z

The conversion script from a regular checkpoint to the universal one runs the followings in parallel.

extracts zero sharded optimizer states
merge the shards

However, it passes map() a set of only a few tasks (the number specified as workers). Thus it needs to wait for the slowest tasks to finish for every set.
This PR submits all the tasks to the pool and wait until the futures get ready. We can keep all workers running.

tjruwase · 2024-04-01T15:44:48Z

@tohtana, this is an amazing improvement. Did you observe any conversion speedups that you can share in this PR?

tohtana · 2024-04-01T15:53:02Z

@tjruwase I converted data on a blob storage. It was totally limited by IO but I still observed 10-20% speed up in total.
At the very initial stage of conversion, it was 2x faster but slowed down soon. I thought it was the limitation of the blog.

…t#5343) The conversion script from a regular checkpoint to the universal one runs the followings in parallel. 1. extracts zero sharded optimizer states 2. merge the shards However, it passes `map()` a set of only a few tasks (the number specified as workers). Thus it needs to wait for the slowest tasks to finish for every set. This PR submits all the tasks to the pool and wait until the futures get ready. We can keep all workers running. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

improve parallel process of universal checkpoint conversion

79c81b0

tohtana requested a review from tjruwase as a code owner April 1, 2024 08:23

tohtana changed the title ~~improve parallel process of universal checkpoint conversion~~ Improve parallel process of universal checkpoint conversion Apr 1, 2024

tjruwase approved these changes Apr 1, 2024

View reviewed changes

tohtana enabled auto-merge April 1, 2024 23:05

tohtana and others added 6 commits April 2, 2024 08:58

Merge branch 'master' into tohtana/ds_to_univ_parallel_pool

4bcd52c

Merge branch 'master' into tohtana/ds_to_univ_parallel_pool

1858199

Merge branch 'microsoft:master' into tohtana/ds_to_univ_parallel_pool

9ddb378

Fixed non-parallel path

b334af7

refactor utility func for parallel work

366ec3e

Merge branch 'master' into tohtana/ds_to_univ_parallel_pool

d4096e8

tohtana added this pull request to the merge queue Apr 22, 2024

Merged via the queue into microsoft:master with commit c292b03 Apr 22, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parallel process of universal checkpoint conversion #5343

Improve parallel process of universal checkpoint conversion #5343

tohtana commented Apr 1, 2024 •

edited

Loading

tjruwase commented Apr 1, 2024

tohtana commented Apr 1, 2024

Improve parallel process of universal checkpoint conversion #5343

Improve parallel process of universal checkpoint conversion #5343

Conversation

tohtana commented Apr 1, 2024 • edited Loading

tjruwase commented Apr 1, 2024

tohtana commented Apr 1, 2024

tohtana commented Apr 1, 2024 •

edited

Loading