Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parallel process of universal checkpoint conversion #5343

Merged
merged 7 commits into from
Apr 22, 2024

Conversation

tohtana
Copy link
Contributor

@tohtana tohtana commented Apr 1, 2024

The conversion script from a regular checkpoint to the universal one runs the followings in parallel.

  1. extracts zero sharded optimizer states
  2. merge the shards

However, it passes map() a set of only a few tasks (the number specified as workers). Thus it needs to wait for the slowest tasks to finish for every set.
This PR submits all the tasks to the pool and wait until the futures get ready. We can keep all workers running.

@tohtana tohtana requested a review from tjruwase as a code owner April 1, 2024 08:23
@tohtana tohtana changed the title improve parallel process of universal checkpoint conversion Improve parallel process of universal checkpoint conversion Apr 1, 2024
@tjruwase
Copy link
Contributor

tjruwase commented Apr 1, 2024

@tohtana, this is an amazing improvement. Did you observe any conversion speedups that you can share in this PR?

@tohtana
Copy link
Contributor Author

tohtana commented Apr 1, 2024

@tjruwase I converted data on a blob storage. It was totally limited by IO but I still observed 10-20% speed up in total.
At the very initial stage of conversion, it was 2x faster but slowed down soon. I thought it was the limitation of the blog.

@tohtana tohtana enabled auto-merge April 1, 2024 23:05
@tohtana tohtana added this pull request to the merge queue Apr 22, 2024
Merged via the queue into microsoft:master with commit c292b03 Apr 22, 2024
12 checks passed
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
…t#5343)

The conversion script from a regular checkpoint to the universal one
runs the followings in parallel.

1. extracts zero sharded optimizer states
2. merge the shards

However, it passes `map()` a set of only a few tasks (the number
specified as workers). Thus it needs to wait for the slowest tasks to
finish for every set.
This PR submits all the tasks to the pool and wait until the futures get
ready. We can keep all workers running.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
umchand pushed a commit to umchand/DeepSpeed that referenced this pull request May 20, 2024
…t#5343)

The conversion script from a regular checkpoint to the universal one
runs the followings in parallel.

1. extracts zero sharded optimizer states
2. merge the shards

However, it passes `map()` a set of only a few tasks (the number
specified as workers). Thus it needs to wait for the slowest tasks to
finish for every set.
This PR submits all the tasks to the pool and wait until the futures get
ready. We can keep all workers running.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
dbyoung18 pushed a commit to dbyoung18/DeepSpeed that referenced this pull request Jun 11, 2024
…t#5343)

The conversion script from a regular checkpoint to the universal one
runs the followings in parallel.

1. extracts zero sharded optimizer states
2. merge the shards

However, it passes `map()` a set of only a few tasks (the number
specified as workers). Thus it needs to wait for the slowest tasks to
finish for every set.
This PR submits all the tasks to the pool and wait until the futures get
ready. We can keep all workers running.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants