DataArray.to_csv() #2289

crusaderky · 2018-07-15T21:56:20Z

I'm using xarray to aggregate 38 GB worth of NetCDF data into a bunch of CSV reports.
I have two problems:

The reports are 500,000 rows by 2,000 columns. Before somebody says "if you're using CSV for this size of data you're doing it wrong" - yes, I know, but it was the only way to make the data accessible to a bunch of people that only know how to use Excel and VBA. 😫
The sheer size of the reports means that (1) it's unsavory to keep the whole thing in RAM (2) pandas to_csv will take ages to complete (as it's single-threaded). The slowness is compounded by the fact that I have to compress everything with gzip.
I have to produce up to 40 reports from the exact same NetCDF files. I use dask to perform the computation, and different reports share a large amount of intermediate graph nodes. So I need to do everything in a single invocation to dask.compute() to allow the dask scheduler to de-duplicate the nodes.

To solve both problems, I wrote a new function:
http://xarray-extras.readthedocs.io/en/latest/api/csv.html

And now my high level wrapper code looks like this:

# DataSet from 200 .nc files, with a total of 500000 points on the 'row' dimension
nc = xarray.open_mfdataset('inputs.*.nc')
reports = [
    # DataArrays with shape (500000, 2000), with the rows split in 200 chunks
    gen_report0(nc),
    gen_report1(nc),
    ....
    gen_report39(nc),
]
futures = [
    # dask.delayed objects
    to_csv(reports[0], 'report0.csv.gz', compression='gzip'),
    to_csv(reports[1], 'report1.csv.gz', compression='gzip'),
    ....
    to_csv(reports[39], 'report39.csv.gz', compression='gzip'),
]
dask.compute(*futures)

The function is currently production quality in xarray-extras, but it would be very easy to refactor it as a method of xarray.DataArray in the main library.

Opinions?

The text was updated successfully, but these errors were encountered:

shoyer · 2018-07-16T04:28:31Z

Interesting. Would it be equivalent to export to a dask dataframe and write that to CSVs, e.g., xarray.concat(reports, dim='col').to_dask_dataframe().to_csv(...)? Or is there some reason why that would be slower/less efficient?

crusaderky · 2018-07-16T22:33:34Z

I assume you mean report.to_dataset('columns').to_dask_dataframe().to_csv(...)?

There's several problems with that:

it doesn't support a MultiIndex on the first dimension, which I need. It could be worked around but only at the cost of a lot of ugly hacking.
it doesn't support writing to a single file, which means I'd need to manually reassemble the file afterwards, which translates to both more code and either I/O ops or RAM sacrificed to /dev/shm.
from my benchmarks, it's 12 to 20 times slower than my implementation. I did not analyse it and I'm completely unfamiliar with dask.dataframe, so I'm not sure where the bottleneck is, but the fact that it doesn't fork into subprocesses (while pandas.DataFrame.to_csv() does not release the GIL) makes me suspicious.

benchmarks: https://gist.github.com/crusaderky/89819258ff960d06136d45526f7d05db

shoyer · 2018-07-17T21:53:37Z

I assume you mean report.to_dataset('columns').to_dask_dataframe().to_csv(...)?

Yes, something like this :).

it doesn't support a MultiIndex on the first dimension, which I need. It could be worked around but only at the cost of a lot of ugly hacking.

By default (if set_index=False), xarray will put variables in separate columns rather than a MultiIndex when converting into a dask dataframe. So this should work fine for exporting to CSV. I'm pretty sure you don't actually need a MultiIndex on each CSV chunk, since you could just pass index=False in to_csv() instead.

We could also potentially add a dask equivalent to the DataArray.to_pandas() method, which would preserves the dimensionality of the argument (e.g., 2D DataArray directly to a 2D dask DataFrame).

it doesn't support writing to a single file, which means I'd need to manually reassemble the file afterwards, which translates to both more code and either I/O ops or RAM sacrificed to /dev/shm.

from my benchmarks, it's 12 to 20 times slower than my implementation. I did not analyse it and I'm completely unfamiliar with dask.dataframe, so I'm not sure where the bottleneck is, but the fact that it doesn't fork into subprocesses (while pandas.DataFrame.to_csv() does not release the GIL) makes me suspicious.

Both of these look like improvements that would be welcome in dask.dataframe, and benefit far more users there than downstream in xarray.

I have been intentionally trying to push more complex code related to distributed computing (e.g., queues and subprocesses) upstream to dask. So far, we have avoided all uses of explicit task graphs in xarray, and have only used dask.delayed in a few places.

crusaderky · 2018-07-17T22:05:29Z

Thing is, I don't know if performance on dask.dataframe is fixable without drastically changing its design. Also while I think dask.array is an amazing building block of xarray, dask.dataframe does feel quite redundant to me...

shoyer · 2018-07-17T22:34:48Z

Thing is, I don't know if performance on dask.dataframe is fixable without drastically changing its design.

I suppose we could at least ask?

Also while I think dask.array is an amazing building block of xarray, dask.dataframe does feel quite redundant to me...

I agree somewhat, but I hope you also understand my reluctance to grow CSV export and distributed computing logic directly in xarray :). Distributed CSV writing is very clearly in scope for dask.dataframe.

If we can push this core logic into dask somewhere, I would welcome a thin to_csv() method in xarray that simply calls underlying dask method.

shoyer · 2018-07-17T23:01:10Z

I would also be very happy to reference xarray_extras specifically (even including an example) for parallel CSV export in the relevant section of our docs, which could be renamed "CSV and other tabular formats".

* Friendlier io title. * Fix lists. * Fix *args, **kwargs "inline emphasis..." * misc * Reference xarray_extras for csv writing. Closes #2289 * Add metpy accessor. Closes #461 * fix transpose docstring. Closes #2576 * Revert "Fix lists." This reverts commit 39983a5. * Revert "Fix *args, **kwargs" This reverts commit 1b9da35. * Add MetPy to related projects. * Add Weather and Climate specific page. * Add hvplot. * Note open_dataset, mfdataset open files as read-only (closes #2345). * Update metpy 1 Co-Authored-By: dcherian <dcherian@users.noreply.github.com> * Update doc/weather-climate.rst Co-Authored-By: dcherian <dcherian@users.noreply.github.com>

* Friendlier io title. * Fix lists. * Fix *args, **kwargs "inline emphasis..." * misc * Reference xarray_extras for csv writing. Closes pydata#2289 * Add metpy accessor. Closes pydata#461 * fix transpose docstring. Closes pydata#2576 * Revert "Fix lists." This reverts commit 39983a5. * Revert "Fix *args, **kwargs" This reverts commit 1b9da35. * Add MetPy to related projects. * Add Weather and Climate specific page. * Add hvplot. * Note open_dataset, mfdataset open files as read-only (closes pydata#2345). * Update metpy 1 Co-Authored-By: dcherian <dcherian@users.noreply.github.com> * Update doc/weather-climate.rst Co-Authored-By: dcherian <dcherian@users.noreply.github.com>

…ns with size>1 (#2757) * Quarter offset implemented (base is now latest pydata-master). (#2721) * Quarter offset implemented (base is now latest pydata-master). * Fixed issues raised in review (#2721 (review)) * Updated whats-new.rst with info on quarter offset support. * Updated whats-new.rst with info on quarter offset support. * Update doc/whats-new.rst Co-Authored-By: jwenfai <jwenfai@gmail.com> * Added support for quarter frequencies when resampling CFTimeIndex. Less redundancy in CFTimeIndex resampling tests. * Removed normalization code (unnecessary for cftime_range) in cftime_offsets.py. Removed redundant lines in whats-new.rst. * Removed invalid option from _get_day_of_month docstring. Added tests back in that raises ValueError when resampling (base=24 when resampling to daily freq, e.g., '8D'). * Minor edits to docstrings/comments * lint * Add `Dataset.drop_dims` (#2767) * ENH: Add Dataset.drop_dims() * Drops full dimensions and any corresponding variables in a Dataset * Fixes GH1949 * DOC: Add Dataset.drop_dims() documentation * Improve name concat (#2792) * Added tests of desired name inferring behaviour * Infers names * updated what's new * Don't use deprecated np.asscalar() (#2800) It got deprecated in numpy 1.16 and throws a ton of warnings due to that. All the function does is returning .item() anyway, which is why it got deprecated. * Add support for cftime.datetime coordinates with coarsen (#2778) * some docs updates (#2746) * Friendlier io title. * Fix lists. * Fix *args, **kwargs "inline emphasis..." * misc * Reference xarray_extras for csv writing. Closes #2289 * Add metpy accessor. Closes #461 * fix transpose docstring. Closes #2576 * Revert "Fix lists." This reverts commit 39983a5. * Revert "Fix *args, **kwargs" This reverts commit 1b9da35. * Add MetPy to related projects. * Add Weather and Climate specific page. * Add hvplot. * Note open_dataset, mfdataset open files as read-only (closes #2345). * Update metpy 1 Co-Authored-By: dcherian <dcherian@users.noreply.github.com> * Update doc/weather-climate.rst Co-Authored-By: dcherian <dcherian@users.noreply.github.com> * Drop failing tests writing multi-dimensional arrays as attributes (#2810) These aren't valid for netCDF files. Fixes GH2803 * Push back finalizing deprecations for 0.12 (#2809) 0.12 will already have a big change in dropping Python 2.7 support. I'd rather wait a bit longer to finalize these deprecations to minimize the impact on users. * enable loading remote hdf5 files (#2782) * attempt at loading remote hdf5 * added a couple tests * rewind bytes after reading header * addressed comments for tests and error message * fixed pep8 formatting * created _get_engine_from_magic_number function, new tests * added description in whats-new * fixed test failure on windows * same error on windows and nix * Release 0.12.0 * Add whats-new for 0.12.1 * Rework whats-new for 0.12 * DOC: Update donation links * DOC: remove outdated warning (#2818) * Allow expand_dims() method to support inserting/broadcasting dimensions with size>1 (#2757) * Make using dim_kwargs for python 3.5 illegal -- a ValueError is thrown * dataset.expand_dims() method take dict like object where values represent length of dimensions or coordinates of dimesnsions * dataarray.expand_dims() method take dict like object where values represent length of dimensions or coordinates of dimesnsions * Add alternative option to passing a dict to the dim argument, which is now an optional kwarg, passing in each new dimension as its own kwarg * Add expand_dims enhancement from issue 2710 to whats-new.rst * Fix test_dataarray.TestDataArray.test_expand_dims_with_greater_dim_size tests to pass in python 3.5 using ordered dicts instead of regular dicts. This was needed because python 3.5 and earlier did not maintain insertion order for dicts * Restrict core logic to use 'dim' as a dict--it will be converted into a dict on entry if it is a str or a sequence of str * Don't cast dim values (coords) as a list since IndexVariable/Variable will internally convert it into a numpy.ndarray. So just use IndexVariable((k,), v) * TypeErrors should be raised for invalid input types, rather than ValueErrors. * Force 'dim' to be OrderedDict for python 3.5 * Allow expand_dims() method to support inserting/broadcasting dimensions with size>1 (#2757) * use .size attribute to determine the size of a dimension, rather than converting to a list, which can be slow for large iterables * Make using dim_kwargs for python 3.5 illegal -- a ValueError is thrown * dataset.expand_dims() method take dict like object where values represent length of dimensions or coordinates of dimesnsions * dataarray.expand_dims() method take dict like object where values represent length of dimensions or coordinates of dimesnsions * Add alternative option to passing a dict to the dim argument, which is now an optional kwarg, passing in each new dimension as its own kwarg * Add expand_dims enhancement from issue 2710 to whats-new.rst * Fix test_dataarray.TestDataArray.test_expand_dims_with_greater_dim_size tests to pass in python 3.5 using ordered dicts instead of regular dicts. This was needed because python 3.5 and earlier did not maintain insertion order for dicts * Restrict core logic to use 'dim' as a dict--it will be converted into a dict on entry if it is a str or a sequence of str * Don't cast dim values (coords) as a list since IndexVariable/Variable will internally convert it into a numpy.ndarray. So just use IndexVariable((k,), v) * TypeErrors should be raised for invalid input types, rather than ValueErrors. * Force 'dim' to be OrderedDict for python 3.5 * Allow expand_dims() method to support inserting/broadcasting dimensions with size>1 (#2757) * Move enhancement description up to 0.12.1 * use .size attribute to determine the size of a dimension, rather than converting to a list, which can be slow for large iterables * Make using dim_kwargs for python 3.5 illegal -- a ValueError is thrown * dataset.expand_dims() method take dict like object where values represent length of dimensions or coordinates of dimesnsions * dataarray.expand_dims() method take dict like object where values represent length of dimensions or coordinates of dimesnsions * Add alternative option to passing a dict to the dim argument, which is now an optional kwarg, passing in each new dimension as its own kwarg * Add expand_dims enhancement from issue 2710 to whats-new.rst * Fix test_dataarray.TestDataArray.test_expand_dims_with_greater_dim_size tests to pass in python 3.5 using ordered dicts instead of regular dicts. This was needed because python 3.5 and earlier did not maintain insertion order for dicts * Restrict core logic to use 'dim' as a dict--it will be converted into a dict on entry if it is a str or a sequence of str * Don't cast dim values (coords) as a list since IndexVariable/Variable will internally convert it into a numpy.ndarray. So just use IndexVariable((k,), v) * TypeErrors should be raised for invalid input types, rather than ValueErrors. * Force 'dim' to be OrderedDict for python 3.5

crusaderky closed this as completed Jul 17, 2018

crusaderky reopened this Jul 17, 2018

crusaderky closed this as completed Jul 17, 2018

crusaderky reopened this Jul 17, 2018

dcherian added the topic-documentation label Feb 5, 2019

dcherian mentioned this issue Feb 5, 2019

some docs updates #2746

Merged

1 task

dcherian pushed a commit to dcherian/xarray that referenced this issue Feb 5, 2019

Reference xarray_extras for csv writing. Closes pydata#2289

7d2b703

dcherian pushed a commit to dcherian/xarray that referenced this issue Mar 7, 2019

Reference xarray_extras for csv writing. Closes pydata#2289

3577fe3

dcherian pushed a commit to dcherian/xarray that referenced this issue Mar 7, 2019

Reference xarray_extras for csv writing. Closes pydata#2289

aae2acc

dcherian closed this as completed in #2746 Mar 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataArray.to_csv() #2289

DataArray.to_csv() #2289

crusaderky commented Jul 15, 2018 •

edited

Loading

shoyer commented Jul 16, 2018

crusaderky commented Jul 16, 2018

shoyer commented Jul 17, 2018

crusaderky commented Jul 17, 2018

shoyer commented Jul 17, 2018

shoyer commented Jul 17, 2018

DataArray.to_csv() #2289

DataArray.to_csv() #2289

Comments

crusaderky commented Jul 15, 2018 • edited Loading

shoyer commented Jul 16, 2018

crusaderky commented Jul 16, 2018

shoyer commented Jul 17, 2018

crusaderky commented Jul 17, 2018

shoyer commented Jul 17, 2018

shoyer commented Jul 17, 2018

crusaderky commented Jul 15, 2018 •

edited

Loading