-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use task-based rechunking to prechunk along partial boundaries #8831
Use task-based rechunking to prechunk along partial boundaries #8831
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 25 files ± 0 25 suites ±0 10h 20m 20s ⏱️ + 8m 25s For more details on these failures, see this check. Results for commit 245f345. ± Comparison against base commit fe79a36. ♻️ This comment has been updated with latest results. |
d1a204c
to
c23fd8f
Compare
.pre-commit-config.yaml
Outdated
@@ -64,7 +64,7 @@ repos: | |||
- tornado | |||
- pyarrow | |||
- urllib3 | |||
- git+https://github.com/dask/dask | |||
- git+https://github.com/hendrikmakait/dask@prechunk-p2p-rechunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Revert before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only 2 nits, the logic seems to be fine that you added, I still don't quite understand the logic in _slice_new_chunks_into_partials
but that't out of scope here
distributed/shuffle/_rechunk.py
Outdated
@@ -431,52 +432,108 @@ def _construct_graph(self) -> _T_LowLevelGraph: | |||
return dsk | |||
|
|||
|
|||
def _prechunk_for_partials( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function name implies to me that we apply the rechunking here, could you rename to something that tells us that we are only calculating the chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
[((2, 2),), ((2, 2),), ((2, 2),)], | ||
[((2, 2),), ((4,),), ((2, 2),)], | ||
[((2, 2),), ((1, 1, 1, 1),), ((2, 2),)], | ||
[((2, 2, 2),), ((1, 2, 2, 1),), ((1, 1, 1, 1, 2),)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the last 2 in expected a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because the 1 in new at the end means that the input chunk only creates 2 output chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about this a little more, and you're right! There was a small bug here. I've adjusted the algorithm accordingly.
looks like there is an env mismatch of some kind because of tags |
This PR faciliates P2P rechunking by offloading pre-chunking along partial boundaries to task-based rechunking. This also functions as a predecessor to pre-concatenation of small chunks which can also be offloaded to P2P rechunking.
pre-commit run --all-files