Some queries #1173

JoyMonteiro · 2016-12-19T22:53:32Z

Hello @shoyer @pwolfram @mrocklin @rabernat ,

I was trying to write a design/requirements doc with ref. to the Columbia meetup,
and I had a few queries, on which I wanted your inputs (basically to ask whether
they make sense or not!)

If you serialize a labeled n-d data array using netCDF or HFD5, it gets written into
a single file, which is not really a good option if you want to eventually do distributed
processing of the data. Things like HDFS/lustreFS can split files, but that is not really
what we want. How do you think this issue could be solved within the xarray+dask
framework?
- is it a matter of adding some code to the dataset.to_netcdf() method or
  adding a new method that would split the DataArray (based on some user guidelines) into multiple files?
- Or does it make more sense to add a new serialization format like Zarr?
Continuing along similar lines, how does xarray+dask currently decide on how to distribute the workload between dask workers? are there any heuristics to handle data locality? or does experience say that network I/O is fast enough that this is not an issue? I'm asking this question because of this article by Matt: http://blaze.pydata.org/blog/2015/10/28/distributed-hdfs/
- If this is desirable, how would one go about implementing it?

shoyer · 2016-12-19T22:59:22Z

For (1), take a look at save_mfdataset for saving to multiple files.

JoyMonteiro · 2016-12-19T23:05:16Z

did not know about that, thanks!!

JoyMonteiro · 2016-12-19T23:08:38Z

@shoyer: does this also work with dask.distributed? The doc seems to only mention a thread pool.

shoyer · 2016-12-19T23:11:05Z

I don't know if anyone has tested writing netCDFs with dask.distributed yet. I suspect the only immediate issue would be errors from dask because threading.Lock() can't be pickled. We need to switch to using dask's SerializableLock() to make dask.distributed work smoothly.

JoyMonteiro · 2016-12-19T23:54:13Z

Thanks. how big of an endeavour is this? I see some free time from 2-3rd week of Jan, and
I could maybe contribute towards making this happen.

mrocklin · 2016-12-20T00:32:40Z

Some related issues:

#798
dask/distributed#629

Will write more shortly.

mrocklin · 2016-12-20T00:36:32Z

There have been some efforts and progress in using many NetCDF files on a distributed POSIX filesystem (NFS, gluster, not HDFS) but there is still some pain here. We should probably circle back up and figure out what still needs to be done (do you have a firm understanding of this @shoyer ?)

HDF5 on HDFS is, I suspect, sufficiently painful so that I would be more tempted to either avoid HDFS or to try other formats like ZArr (which I'm somewhat biased towards) (cc @alimanfoo). However my experience has been that most climate data lives on a POSIX file system, so experimentation here may not be high priority.

@JoyMonteiro if you have time then the first thing to do is to probably start using things and report where they're broken. I'm confident that small things will present themselves quickly :)

JoyMonteiro · 2016-12-20T12:44:25Z

Playing around with things sounds like much more fun :) I can see how this will be useful, will start thinking of some test cases to code.

pwolfram · 2016-12-20T21:03:31Z

@JoyMonteiro and @shoyer, as I've been thinking about this more and especially regarding #463, I was planning on building on opener from #1128 to essentially open, read, and then close a file each time a read get operation was needed on a newCDF file. My initial view was that output fundamentally would be serial but as @JoyMonteiro points out, there may be a benefit to making a provision for parallel output. However, we will probably run into the same netCDF limitation on the number of open files. Would we want similar functionality on opener for set as well as the get methods? I'm not sure how something like sync would work in this context and suspect this could lead to problems. Presumably we would be requiring writing each dimension, attribute, variable, etc at each call with its own associated open, write, and close. I obviously need to find the time to dig into this further...

…writing Fixes pydata#1172 The serializable lock will be useful for dask.distributed or multi-processing (xref pydata#798, pydata#1173, among others).

shoyer · 2016-12-22T02:51:29Z

#1179 will make use of SerializableLock() for our default netCDF4/HDF5 lock.

…ing (#1179) * Switch to shared Lock (SerializableLock if possible) for reading and writing Fixes #1172 The serializable lock will be useful for dask.distributed or multi-processing (xref #798, #1173, among others). * Test serializable lock * Use conda-forge for builds * remove broken/fragile .test_lock

jhamman · 2019-01-13T06:00:22Z

Closing this old issue. We've taken care of most of the issues discussed here through the various backend updates over the past two years.

shoyer mentioned this issue Dec 22, 2016

Switch to shared Lock (SerializableLock if possible) for reading/writing #1179

Merged

jhamman closed this as completed Jan 13, 2019

jhamman added topic-dask topic-backends labels Jan 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some queries #1173

Some queries #1173

JoyMonteiro commented Dec 19, 2016

shoyer commented Dec 19, 2016

JoyMonteiro commented Dec 19, 2016

JoyMonteiro commented Dec 19, 2016

shoyer commented Dec 19, 2016

JoyMonteiro commented Dec 19, 2016

mrocklin commented Dec 20, 2016

mrocklin commented Dec 20, 2016

JoyMonteiro commented Dec 20, 2016

pwolfram commented Dec 20, 2016

shoyer commented Dec 22, 2016

jhamman commented Jan 13, 2019

Some queries #1173

Some queries #1173

Comments

JoyMonteiro commented Dec 19, 2016

shoyer commented Dec 19, 2016

JoyMonteiro commented Dec 19, 2016

JoyMonteiro commented Dec 19, 2016

shoyer commented Dec 19, 2016

JoyMonteiro commented Dec 19, 2016

mrocklin commented Dec 20, 2016

mrocklin commented Dec 20, 2016

JoyMonteiro commented Dec 20, 2016

pwolfram commented Dec 20, 2016

shoyer commented Dec 22, 2016

jhamman commented Jan 13, 2019