You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to help with this, but I would appreciate input on the approach to increase likelihood of merge.
The gist is that I'd like some way to allow (but not require) stores to be smarter about batching or parallelizing chunk access. Stores may know something that zarr doesn't about how chunks are arranged and may be able to optimize better than the zarr library.
Concretely, #535 will allow Zarr to read remote NetCDF4 / HDF5 files using Range GET operations. In the wild, these files often require many, many chunk reads of (near-)contiguous bytes for common operations. A smarter store could merge contiguous byte ranges into single requests and provide multiple ranges in a single request, as the Range header allows.
The core issue is that Zarr's current implementation asks a store for chunks one key at a time, synchronously. I see three possible ways to fix this, each with pros and cons, and it's possible I'm missing others:
Option 1: Allow users to request concurrency
PR #534 implements this. A sufficiently smart store could collect concurrent requests before sending them out and merge as appropriate.
Pros:
Already largely implemented
Low-impact to current interfaces
Broad benefits to stores existing stores with no additional changes to the stores themselves
Cons:
Requires overhead and limitations of threads. Imagine an array split into 10,000 chunks. That ends up being 10,000 async ops that need to be scattered only to be immediately gathered in a thread-safe way.
Requires users to ask for concurrency when there is no down-side to the store always providing it
Requires the backing store to do a fair amount of complex thread-aware work, possibly waiting a very brief amount of time before sending out any request to make sure it's appropriately gathering requests.
Loses possibly helpful sequencing information, as chunk keys come in in a random order
Option 2: Allow stores to request concurrency
This is almost the same as option 1, but instead of users saying they want concurrency, the store can specify that it wants it via some property, interface, etc.
Pros (vs option 1):
Avoids issues when used on stores that are not thread-safe
Can be automatic / default
Cons (vs option 1):
There may be cases where users would want to be explicit about whether or not to use which type of concurrency.
Option 3: Zarr optionally requests a list of keys rather than one at a time
The idea here would be to allow stores to optionally implement a method, say get_items that accepts a list of keys and returns an iterable response corresponding to their values. Stores not implementing this interface would get the current implementation.
Pros:
Much simpler implementation for both Zarr and stores implementing the method
More efficient, requiring no pauses or unhelpful threads
Can provide additional efficiencies if Zarr specifies, say, that it will specify the keys in ascending order by axis.
Cons:
While stores could still just implement MutableMapping, this would introduce an additional interface method that is non-standard
May clash with behavior from option 1. What do you do when you've been simultaneously asked for concurrent execution and the store allows you to pass an array of chunks? Note: does not clash with option 2.
Less benefit to stores without this access need. It would allow those stores an easy way to provide their own concurrency, but it wouldn't be as turnkey as the prior two options.
Option 4: ???
I'm not an expert on Zarr. Am I missing something?
Please let me know if this is worth pursuing and how you'd like me to proceed.
The text was updated successfully, but these errors were encountered:
Just to say thanks for raising this, important discussion that relates to other discussions also happening elsewhere (e.g., #536). Will follow up with more comments asap.
I'd like to help with this, but I would appreciate input on the approach to increase likelihood of merge.
The gist is that I'd like some way to allow (but not require) stores to be smarter about batching or parallelizing chunk access. Stores may know something that zarr doesn't about how chunks are arranged and may be able to optimize better than the zarr library.
Concretely, #535 will allow Zarr to read remote NetCDF4 / HDF5 files using Range GET operations. In the wild, these files often require many, many chunk reads of (near-)contiguous bytes for common operations. A smarter store could merge contiguous byte ranges into single requests and provide multiple ranges in a single request, as the Range header allows.
The core issue is that Zarr's current implementation asks a store for chunks one key at a time, synchronously. I see three possible ways to fix this, each with pros and cons, and it's possible I'm missing others:
Option 1: Allow users to request concurrency
PR #534 implements this. A sufficiently smart store could collect concurrent requests before sending them out and merge as appropriate.
Pros:
Cons:
Option 2: Allow stores to request concurrency
This is almost the same as option 1, but instead of users saying they want concurrency, the store can specify that it wants it via some property, interface, etc.
Pros (vs option 1):
Cons (vs option 1):
Option 3: Zarr optionally requests a list of keys rather than one at a time
The idea here would be to allow stores to optionally implement a method, say
get_items
that accepts a list of keys and returns an iterable response corresponding to their values. Stores not implementing this interface would get the current implementation.Pros:
Cons:
Option 4: ???
I'm not an expert on Zarr. Am I missing something?
Please let me know if this is worth pursuing and how you'd like me to proceed.
The text was updated successfully, but these errors were encountered: