Allow stores to be smarter when accessing multiple chunks #547

bilts · 2020-03-22T15:12:21Z

I'd like to help with this, but I would appreciate input on the approach to increase likelihood of merge.

The gist is that I'd like some way to allow (but not require) stores to be smarter about batching or parallelizing chunk access. Stores may know something that zarr doesn't about how chunks are arranged and may be able to optimize better than the zarr library.

Concretely, #535 will allow Zarr to read remote NetCDF4 / HDF5 files using Range GET operations. In the wild, these files often require many, many chunk reads of (near-)contiguous bytes for common operations. A smarter store could merge contiguous byte ranges into single requests and provide multiple ranges in a single request, as the Range header allows.

The core issue is that Zarr's current implementation asks a store for chunks one key at a time, synchronously. I see three possible ways to fix this, each with pros and cons, and it's possible I'm missing others:

Option 1: Allow users to request concurrency

PR #534 implements this. A sufficiently smart store could collect concurrent requests before sending them out and merge as appropriate.

Pros:

Already largely implemented
Low-impact to current interfaces
Broad benefits to stores existing stores with no additional changes to the stores themselves

Cons:

Requires overhead and limitations of threads. Imagine an array split into 10,000 chunks. That ends up being 10,000 async ops that need to be scattered only to be immediately gathered in a thread-safe way.
Requires users to ask for concurrency when there is no down-side to the store always providing it
Requires the backing store to do a fair amount of complex thread-aware work, possibly waiting a very brief amount of time before sending out any request to make sure it's appropriately gathering requests.
Loses possibly helpful sequencing information, as chunk keys come in in a random order

Option 2: Allow stores to request concurrency

This is almost the same as option 1, but instead of users saying they want concurrency, the store can specify that it wants it via some property, interface, etc.

Pros (vs option 1):

Avoids issues when used on stores that are not thread-safe
Can be automatic / default

Cons (vs option 1):

There may be cases where users would want to be explicit about whether or not to use which type of concurrency.

Option 3: Zarr optionally requests a list of keys rather than one at a time

The idea here would be to allow stores to optionally implement a method, say get_items that accepts a list of keys and returns an iterable response corresponding to their values. Stores not implementing this interface would get the current implementation.

Pros:

Much simpler implementation for both Zarr and stores implementing the method
More efficient, requiring no pauses or unhelpful threads
Can provide additional efficiencies if Zarr specifies, say, that it will specify the keys in ascending order by axis.

Cons:

While stores could still just implement MutableMapping, this would introduce an additional interface method that is non-standard
May clash with behavior from option 1. What do you do when you've been simultaneously asked for concurrent execution and the store allows you to pass an array of chunks? Note: does not clash with option 2.
Less benefit to stores without this access need. It would allow those stores an easy way to provide their own concurrency, but it wouldn't be as turnkey as the prior two options.

Option 4: ???

I'm not an expert on Zarr. Am I missing something?

Please let me know if this is worth pursuing and how you'd like me to proceed.

The text was updated successfully, but these errors were encountered:

alimanfoo · 2020-03-31T12:04:27Z

Just to say thanks for raising this, important discussion that relates to other discussions also happening elsewhere (e.g., #536). Will follow up with more comments asap.

bilts mentioned this issue Mar 22, 2020

WIP: Concurrent block reads and writes #534

Closed

6 tasks

bilts mentioned this issue Mar 31, 2020

async in zarr #536

Closed

joshmoore mentioned this issue Sep 23, 2021

Outreachy project proposals (Oct. 2021) zarr-developers/community#39

Closed

akshaysubr mentioned this issue Apr 20, 2023

Allow batched/concurrent (de)compression support #1398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow stores to be smarter when accessing multiple chunks #547

Allow stores to be smarter when accessing multiple chunks #547

bilts commented Mar 22, 2020 •

edited

Loading

alimanfoo commented Mar 31, 2020

Allow stores to be smarter when accessing multiple chunks #547

Allow stores to be smarter when accessing multiple chunks #547

Comments

bilts commented Mar 22, 2020 • edited Loading

Option 1: Allow users to request concurrency

Option 2: Allow stores to request concurrency

Option 3: Zarr optionally requests a list of keys rather than one at a time

Option 4: ???

alimanfoo commented Mar 31, 2020

bilts commented Mar 22, 2020 •

edited

Loading