-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add __dask_tokenize__ methods to Zarr objects #202
Comments
In dask.base.normalize_token.register(zarr.Array)(lambda v: v.hexdigest("sha1")) |
An alternative might be to use a weaker hash than Another option to consider would be hashes that are not cryptographically secure, but sufficient for distinguishing values and significantly faster. Some techniques Dask tries are CityHash, xxHash, and MurmurHash. More details in PR ( dask/dask#2377 ). Of these only saw the ability to handle hashing incrementally with the Python library Could also look at CRC algorithms as well. The Finally it is worth looking at hash trees like Merkle trees (what Dask uses 😉). These compute hashes for leaf nodes and then compute hashes for all other nodes based on hash the concatenated hashes of the children. As can be imagined this can be much faster than simply computing the hash of everything for two reasons. First parallelism is now an option. Second hashes typically pass a rolling window over the data, but breaking the data into chunks and hashing chunks separately effectively removes some of the regions this window would roll over. This could be combined with any of the techniques above. We could even get Dask to do this work for us by loading chunks directly into Dask. |
Any thoughts on how best to tackle this one, @mrocklin? |
In cases where zarr arrays point to very large datasets, hashing all of the data is probably not feasible. In such cases you would probably want to defer to the underlying storage mechanism, which often has hashes pre-computed (S3 and GS do this). This would probably require that we defer this operation to the underlying MutableMapping, which would require changes in a number of upstream libraries. |
That's a good point! Currently lack a clear picture on what scale other people use Zarr for. Getting precomputed hashes from cloud storage services (when used) sounds like a great idea. Fixing multiple upstream libraries sounds tricky, how would you propose tackling that? Do you have an API in mind? How do we handle hashing of things like On a related note, had been thinking we might be able to hash chunks as they are written to. Delaying hashing of the full array until it is read from, but using the precomputed chunk hashes to do so. This avoids having to hash everything from scratch each time. Not sure how well this scales offhand. |
I would implement Each of the upstream libraries (s3fs, gcsfs, hdfs3) would implement For DirectoryStore I would probably tokenize by the filenames contained within the directory and the modification times. For ZipStore I would probably tokenize by the filename, and the modification time. I imagine you could store hashes within Zarr metadata itself. This would presumably be an extension of the protocol. |
Alright thinking about how to consolidate these case a bit. What if we simply add the modified time to Side note: We can add the modified time to |
We might also want to give some thought to how |
Does this mean that all Zarr operations would be responsible for updating the modification time metadata in the zarr dataset? |
Any writing operations, yes. Shouldn't be too expensive since we would be writing anyways and this just the time. Though if we do this in |
Yeah, I can imagine that this would be difficult to accomplish safely and efficiently at scale. |
Would be good to have
__dask_tokenize__
methods added toArray
and possiblyGroup
classes. These methods are used to generate unique identifiers for different Dask objects. By default, they will be totally random if the objects in question don't define this method or register some function for making this determination with Dask. Having these methods will allow reuse of ZarrArray
s that are loaded repeatedly. Also this will be helpful for some applications like caching where deterministic hashes are needed.ref: http://dask.pydata.org/en/latest/custom-collections.html#deterministic-hashing
The text was updated successfully, but these errors were encountered: