Why does concatenating zarr.ZipStores with xarray and dask use so much unmanaged memory? #5746
Unanswered
dougiesquire
asked this question in
Q&A
Replies: 1 comment
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I’m not sure of the best place to ask my question since it intersects xarray, dask and zarr. Please direct me elsewhere if there is somewhere more appropriate.
I often want to concatenate sets of zarr collections (stored on a file system) to write back out as single dataset. Often these zarr collections are zipped to keep my HPC storage admins happy (less files = good).
When concatenating many or large zarr.ZipStores, I find that the unmanaged memory (as shown by the dask dashboard) climbs rapidly and often brings down my cluster. This does not happen when the data are not zipped (i.e. zarr.DirectoryStores).
I’ve included an example below that demonstrates the problem, although note that the problem is more exacerbated when working with real/larger datasets. Note too that this example creates many zarr files in the current directory, so be careful running it:
Note that in the above example, the total size of all the files being concatenated is 3.3 GB.
My question: Is this expected behaviour? Am I just seeing the additional overhead associated with having to open the zip file? Or is there potentially a
close
orcopy
or something else needed somewhere? Any insight/advice would be greatly appreciated!Update:
I came across this open PR about closing zarr.ZipStores: #4395. But explicitly closing the ZipStores as follows doesn't make any difference:
Beta Was this translation helpful? Give feedback.
All reactions