Skip to content

Commit

Permalink
API for N-dimensional combine (#2616)
Browse files Browse the repository at this point in the history
* concatenates along a single dimension

* Wrote function to find correct tile_IDs from nested list of datasets

* Wrote function to check that combined_tile_ids structure is valid

* Added test of 2d-concatenation

* Tests now check that dataset ordering is correct

* Test concatentation along a new dimension

* Started generalising auto_combine to N-D by integrating the N-D concatentation algorithm

* All unit tests now passing

* Fixed a failing test which I didn't notice because I don't have pseudoNetCDF

* Began updating open_mfdataset to handle N-D input

* Refactored to remove duplicate logic in open_mfdataset & auto_combine

* Implemented Shoyers suggestion in #2553 to rewrite the recursive nested list traverser as an iterator

* --amend

* Now raises ValueError if input not ordered correctly before concatenation

* Added some more prototype tests defining desired behaviour more clearly

* Now raises informative errors on invalid forms of input

* Refactoring to alos merge along each dimension

* Refactored to literally just apply the old auto_combine along each dimension

* Added unit tests for open_mfdatset

* Removed TODOs

* Removed format strings

* test_get_new_tile_ids now doesn't assume dicts are ordered

* Fixed failing tests on python3.5 caused by accidentally assuming dict was ordered

* Test for getting new tile id

* Fixed itertoolz import so that it's compatible with older versions

* Increased test coverage

* Added toolz as an explicit dependency to pass tests on python2.7

* Updated 'what's new'

* No longer attempts to shortcut all concatenation at once if concat_dims=None

* Rewrote using itertools.groupby instead of toolz.itertoolz.groupby to remove hidden dependency on toolz

* Fixed erroneous removal of utils import

* Updated docstrings to include an example of multidimensional concatenation

* Clarified auto_combine docstring for N-D behaviour

* Added unit test for nested list of Datasets with different variables

* Minor spelling and pep8 fixes

* Started working on a new api with both auto_combine and manual_combine

* Wrote basic function to infer concatenation order from coords.

Needs better error handling though.

* Attempt at finalised version of public-facing API.

All the internals still need to be redone to match though.

* No longer uses entire old auto_combine internally, only concat or merge

* Updated what's new

* Removed uneeded addition to what's new for old release

* Fixed incomplete merge in docstring for open_mfdataset

* Tests for manual combine passing

* Tests for auto_combine now passing

* xfailed weird behaviour with manual_combine trying to determine concat_dim

* Add auto_combine and manual_combine to API page of docs

* Tests now passing for open_mfdataset

* Completed merge so that #2648 is respected, and added tests.

Also moved concat to it's own file to avoid a circular dependency

* Separated the tests for concat and both combines

* Some PEP8 fixes

* Pre-empting a test which will fail with opening uamiv format

* Satisfy pep8speaks bot

* Python 3.5 compatibile after changing some error string formatting

* Order coords using pandas.Index objects

* Fixed performance bug from GH #2662

* Removed ToDos about natural sorting of string coords

* Generalized auto_combine to handle monotonically-decreasing coords too

* Added more examples to docstring for manual_combine

* Added note about globbing aspect of open_mfdataset

* Removed auto-inferring of concatenation dimension in manual_combine

* Added example to docstring for auto_combine

* Minor correction to docstring

* Another very minor docstring correction

* Added test to guard against issue #2777

* Started deprecation cycle for auto_combine

* Fully reverted open_mfdataset tests

* Updated what's new to match deprecation cycle

* Reverted uamiv test

* Removed dependency on itertools

* Deprecation tests fixed

* Satisfy pycodestyle

* Started deprecation cycle of auto_combine

* Added specific error for edge case combine_manual can't handle

* Check that global coordinates are monotonic

* Highlighted weird behaviour when concatenating with no data variables

* Added test for impossible-to-auto-combine coordinates

* Removed uneeded test

* Satisfy linter

* Added airspeedvelocity benchmark for combining functions

* Benchmark will take longer now

* Updated version numbers in deprecation warnings to fit with recent release of 0.12

* Updated api docs for new function names

* Fixed docs build failure

* Revert "Fixed docs build failure"

This reverts commit ddfc6dd.

* Updated documentation with section explaining new functions

* Suppressed deprecation warnings in test suite

* Resolved ToDo by pointing to issue with concat, see #2975

* Various docs fixes

* Slightly renamed tests to match new name of tested function

* Included minor suggestions from shoyer

* Removed trailing whitespace

* Simplified error message for case combine_manual can't handle

* Removed filter for deprecation warnings, and added test for if user doesn't supply concat_dim

* Simple fixes suggested by shoyer

* Change deprecation warning behaviour

* linting
  • Loading branch information
TomNicholas authored and shoyer committed Jun 25, 2019
1 parent 76adf13 commit 6b33ad8
Show file tree
Hide file tree
Showing 13 changed files with 2,066 additions and 1,077 deletions.
37 changes: 37 additions & 0 deletions asv_bench/benchmarks/combine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import numpy as np
import xarray as xr


class Combine:
"""Benchmark concatenating and merging large datasets"""

def setup(self):
"""Create 4 datasets with two different variables"""

t_size, x_size, y_size = 100, 900, 800
t = np.arange(t_size)
data = np.random.randn(t_size, x_size, y_size)

self.dsA0 = xr.Dataset(
{'A': xr.DataArray(data, coords={'T': t},
dims=('T', 'X', 'Y'))})
self.dsA1 = xr.Dataset(
{'A': xr.DataArray(data, coords={'T': t + t_size},
dims=('T', 'X', 'Y'))})
self.dsB0 = xr.Dataset(
{'B': xr.DataArray(data, coords={'T': t},
dims=('T', 'X', 'Y'))})
self.dsB1 = xr.Dataset(
{'B': xr.DataArray(data, coords={'T': t + t_size},
dims=('T', 'X', 'Y'))})

def time_combine_manual(self):
datasets = [[self.dsA0, self.dsA1], [self.dsB0, self.dsB1]]

xr.combine_manual(datasets, concat_dim=[None, 't'])

def time_auto_combine(self):
"""Also has to load and arrange t coordinate"""
datasets = [self.dsA0, self.dsA1, self.dsB0, self.dsB1]

xr.combine_auto(datasets)
3 changes: 3 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ Top-level functions
broadcast
concat
merge
auto_combine
combine_auto
combine_manual
where
set_options
full_like
Expand Down
78 changes: 76 additions & 2 deletions doc/combining.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,10 @@ Combining data
import xarray as xr
np.random.seed(123456)
* For combining datasets or data arrays along a dimension, see concatenate_.
* For combining datasets or data arrays along a single dimension, see concatenate_.
* For combining datasets with different variables, see merge_.
* For combining datasets or data arrays with different indexes or missing values, see combine_.
* For combining datasets or data arrays along multiple dimensions see combining.multi_.

.. _concatenate:

Expand Down Expand Up @@ -77,7 +78,7 @@ Merge
~~~~~

To combine variables and coordinates between multiple ``DataArray`` and/or
``Dataset`` object, use :py:func:`~xarray.merge`. It can merge a list of
``Dataset`` objects, use :py:func:`~xarray.merge`. It can merge a list of
``Dataset``, ``DataArray`` or dictionaries of objects convertible to
``DataArray`` objects:

Expand Down Expand Up @@ -237,3 +238,76 @@ coordinates as long as any non-missing values agree or are disjoint:
Note that due to the underlying representation of missing values as floating
point numbers (``NaN``), variable data type is not always preserved when merging
in this manner.

.. _combining.multi:

Combining along multiple dimensions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. note::

There are currently three combining functions with similar names:
:py:func:`~xarray.auto_combine`, :py:func:`~xarray.combine_auto`, and
:py:func:`~xarray.combine_manual`. This is because
``auto_combine`` is in the process of being deprecated in favour of the other
two functions, which are more general. If your code currently relies on
``auto_combine``, then you will be able to get similar functionality by using
``combine_manual``.

For combining many objects along multiple dimensions xarray provides
:py:func:`~xarray.combine_manual`` and :py:func:`~xarray.combine_auto`. These
functions use a combination of ``concat`` and ``merge`` across different
variables to combine many objects into one.

:py:func:`~xarray.combine_manual`` requires specifying the order in which the
objects should be combined, while :py:func:`~xarray.combine_auto` attempts to
infer this ordering automatically from the coordinates in the data.

:py:func:`~xarray.combine_manual` is useful when you know the spatial
relationship between each object in advance. The datasets must be provided in
the form of a nested list, which specifies their relative position and
ordering. A common task is collecting data from a parallelized simulation where
each processor wrote out data to a separate file. A domain which was decomposed
into 4 parts, 2 each along both the x and y axes, requires organising the
datasets into a doubly-nested list, e.g:

.. ipython:: python
arr = xr.DataArray(name='temperature', data=np.random.randint(5, size=(2, 2)), dims=['x', 'y'])
arr
ds_grid = [[arr, arr], [arr, arr]]
xr.combine_manual(ds_grid, concat_dim=['x', 'y'])
:py:func:`~xarray.combine_manual` can also be used to explicitly merge datasets
with different variables. For example if we have 4 datasets, which are divided
along two times, and contain two different variables, we can pass ``None``
to ``'concat_dim'`` to specify the dimension of the nested list over which
we wish to use ``merge`` instead of ``concat``:

.. ipython:: python
temp = xr.DataArray(name='temperature', data=np.random.randn(2), dims=['t'])
precip = xr.DataArray(name='precipitation', data=np.random.randn(2), dims=['t'])
ds_grid = [[temp, precip], [temp, precip]]
xr.combine_manual(ds_grid, concat_dim=['t', None])
:py:func:`~xarray.combine_auto` is for combining objects which have dimension
coordinates which specify their relationship to and order relative to one
another, for example a linearly-increasing 'time' dimension coordinate.

Here we combine two datasets using their common dimension coordinates. Notice
they are concatenated in order based on the values in their dimension
coordinates, not on their position in the list passed to ``combine_auto``.

.. ipython:: python
:okwarning:
x1 = xr.DataArray(name='foo', data=np.random.randn(3), coords=[('x', [0, 1, 2])])
x2 = xr.DataArray(name='foo', data=np.random.randn(3), coords=[('x', [3, 4, 5])])
xr.combine_auto([x2, x1])
These functions can be used by :py:func:`~xarray.open_mfdataset` to open many
files as one dataset. The particular function used is specified by setting the
argument ``'combine'`` to ``'auto'`` or ``'manual'``. This is useful for
situations where your data is split across many files in multiple locations,
which have some known relationship between one another.
8 changes: 6 additions & 2 deletions doc/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -766,7 +766,10 @@ Combining multiple files

NetCDF files are often encountered in collections, e.g., with different files
corresponding to different model runs. xarray can straightforwardly combine such
files into a single Dataset by making use of :py:func:`~xarray.concat`.
files into a single Dataset by making use of :py:func:`~xarray.concat`,
:py:func:`~xarray.merge`, :py:func:`~xarray.combine_manual` and
:py:func:`~xarray.combine_auto`. For details on the difference between these
functions see :ref:`combining data`.

.. note::

Expand All @@ -779,7 +782,8 @@ files into a single Dataset by making use of :py:func:`~xarray.concat`.
This function automatically concatenates and merges multiple files into a
single xarray dataset.
It is the recommended way to open multiple files with xarray.
For more details, see :ref:`dask.io` and a `blog post`_ by Stephan Hoyer.
For more details, see :ref:`combining.multi`, :ref:`dask.io` and a
`blog post`_ by Stephan Hoyer.

.. _dask: http://dask.pydata.org
.. _blog post: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
Expand Down
21 changes: 21 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,23 @@ Enhancements
helpful for avoiding file-lock errors when trying to write to files opened
using ``open_dataset()`` or ``open_dataarray()``. (:issue:`2887`)
By `Dan Nowacki <https://github.com/dnowacki-usgs>`_.
- Combining datasets along N dimensions:
Datasets can now be combined along any number of dimensions,
instead of just a one-dimensional list of datasets.

The new ``combine_manual`` will accept the datasets as a a nested
list-of-lists, and combine by applying a series of concat and merge
operations. The new ``combine_auto`` will instead use the dimension
coordinates of the datasets to order them.

``open_mfdataset`` can use either ``combine_manual`` or ``combine_auto`` to
combine datasets along multiple dimensions, by specifying the argument
`combine='manual'` or `combine='auto'`.

This means that the original function ``auto_combine`` is being deprecated.
To avoid FutureWarnings switch to using `combine_manual` or `combine_auto`,
(or set the `combine` argument in `open_mfdataset`). (:issue:`2159`)
By `Tom Nicholas <http://github.com/TomNicholas>`_.
- Better warning message when supplying invalid objects to ``xr.merge``
(:issue:`2948`). By `Mathias Hauser <https://github.com/mathause>`_.
- Added ``strftime`` method to ``.dt`` accessor, making it simpler to hand a
Expand Down Expand Up @@ -203,6 +220,10 @@ Other enhancements
report showing what exactly differs between the two objects (dimensions /
coordinates / variables / attributes) (:issue:`1507`).
By `Benoit Bovy <https://github.com/benbovy>`_.
- Resampling of standard and non-standard calendars indexed by
:py:class:`~xarray.CFTimeIndex` is now possible. (:issue:`2191`).
By `Jwen Fai Low <https://github.com/jwenfai>`_ and
`Spencer Clark <https://github.com/spencerkclark>`_.
- Add ``tolerance`` option to ``resample()`` methods ``bfill``, ``pad``,
``nearest``. (:issue:`2695`)
By `Hauke Schulz <https://github.com/observingClouds>`_.
Expand Down
3 changes: 2 additions & 1 deletion xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@

from .core.alignment import align, broadcast, broadcast_arrays
from .core.common import full_like, zeros_like, ones_like
from .core.combine import concat, auto_combine
from .core.concat import concat
from .core.combine import combine_auto, combine_manual, auto_combine
from .core.computation import apply_ufunc, dot, where
from .core.extensions import (register_dataarray_accessor,
register_dataset_accessor)
Expand Down
Loading

0 comments on commit 6b33ad8

Please sign in to comment.