Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some queries #1173

Closed
JoyMonteiro opened this issue Dec 19, 2016 · 11 comments
Closed

Some queries #1173

JoyMonteiro opened this issue Dec 19, 2016 · 11 comments

Comments

@JoyMonteiro
Copy link

Hello @shoyer @pwolfram @mrocklin @rabernat ,

I was trying to write a design/requirements doc with ref. to the Columbia meetup,
and I had a few queries, on which I wanted your inputs (basically to ask whether
they make sense or not!)

  1. If you serialize a labeled n-d data array using netCDF or HFD5, it gets written into
    a single file, which is not really a good option if you want to eventually do distributed
    processing of the data. Things like HDFS/lustreFS can split files, but that is not really
    what we want. How do you think this issue could be solved within the xarray+dask
    framework?
    • is it a matter of adding some code to the dataset.to_netcdf() method or
      adding a new method that would split the DataArray (based on some user guidelines) into multiple files?
    • Or does it make more sense to add a new serialization format like Zarr?
  2. Continuing along similar lines, how does xarray+dask currently decide on how to distribute the workload between dask workers? are there any heuristics to handle data locality? or does experience say that network I/O is fast enough that this is not an issue? I'm asking this question because of this article by Matt: http://blaze.pydata.org/blog/2015/10/28/distributed-hdfs/
    • If this is desirable, how would one go about implementing it?
@shoyer
Copy link
Member

shoyer commented Dec 19, 2016

For (1), take a look at save_mfdataset for saving to multiple files.

@JoyMonteiro
Copy link
Author

did not know about that, thanks!!

@JoyMonteiro
Copy link
Author

@shoyer: does this also work with dask.distributed? The doc seems to only mention a thread pool.

@shoyer
Copy link
Member

shoyer commented Dec 19, 2016

I don't know if anyone has tested writing netCDFs with dask.distributed yet. I suspect the only immediate issue would be errors from dask because threading.Lock() can't be pickled. We need to switch to using dask's SerializableLock() to make dask.distributed work smoothly.

@JoyMonteiro
Copy link
Author

Thanks. how big of an endeavour is this? I see some free time from 2-3rd week of Jan, and
I could maybe contribute towards making this happen.

@mrocklin
Copy link
Contributor

Some related issues:

#798
dask/distributed#629

Will write more shortly.

@mrocklin
Copy link
Contributor

There have been some efforts and progress in using many NetCDF files on a distributed POSIX filesystem (NFS, gluster, not HDFS) but there is still some pain here. We should probably circle back up and figure out what still needs to be done (do you have a firm understanding of this @shoyer ?)

HDF5 on HDFS is, I suspect, sufficiently painful so that I would be more tempted to either avoid HDFS or to try other formats like ZArr (which I'm somewhat biased towards) (cc @alimanfoo). However my experience has been that most climate data lives on a POSIX file system, so experimentation here may not be high priority.

@JoyMonteiro if you have time then the first thing to do is to probably start using things and report where they're broken. I'm confident that small things will present themselves quickly :)

@JoyMonteiro
Copy link
Author

Playing around with things sounds like much more fun :) I can see how this will be useful, will start thinking of some test cases to code.

@pwolfram
Copy link
Contributor

@JoyMonteiro and @shoyer, as I've been thinking about this more and especially regarding #463, I was planning on building on opener from #1128 to essentially open, read, and then close a file each time a read get operation was needed on a newCDF file. My initial view was that output fundamentally would be serial but as @JoyMonteiro points out, there may be a benefit to making a provision for parallel output. However, we will probably run into the same netCDF limitation on the number of open files. Would we want similar functionality on opener for set as well as the get methods? I'm not sure how something like sync would work in this context and suspect this could lead to problems. Presumably we would be requiring writing each dimension, attribute, variable, etc at each call with its own associated open, write, and close. I obviously need to find the time to dig into this further...

shoyer added a commit to shoyer/xarray that referenced this issue Dec 22, 2016
…writing

Fixes pydata#1172

The serializable lock will be useful for dask.distributed or multi-processing
(xref pydata#798, pydata#1173, among others).
@shoyer
Copy link
Member

shoyer commented Dec 22, 2016

#1179 will make use of SerializableLock() for our default netCDF4/HDF5 lock.

shoyer added a commit that referenced this issue Jan 4, 2017
…ing (#1179)

* Switch to shared Lock (SerializableLock if possible) for reading and writing

Fixes #1172

The serializable lock will be useful for dask.distributed or multi-processing
(xref #798, #1173, among others).

* Test serializable lock

* Use conda-forge for builds

* remove broken/fragile .test_lock
@jhamman
Copy link
Member

jhamman commented Jan 13, 2019

Closing this old issue. We've taken care of most of the issues discussed here through the various backend updates over the past two years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants