Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make passing a DataArray for the xarray.concat dim argument equivalent to passing a pandas Index #1646

Open
ceridwen opened this issue Oct 23, 2017 · 8 comments

Comments

@ceridwen
Copy link

Extending from #839, if I'm concatenating some DataArrays using concat,

print(xarray.concat(data, xarray.DataArray(['foo1', 'foo2', 'foo3', 'foo4', 'foo5'], name='stat'))

I get an unnamed dimension without coordinates.

<xarray.DataArray (dim_0: 5, index: 2)>
array([[ 24.841064,   0.750451],
       [ 24.841064,   0.750451],
       [ 19.062874,   0.796722],
       [ 14.9631  ,   0.354273],
       [ 14.9631  ,   0.354273]])
Coordinates:
  * index    (index) object 'Intercept' 'Lvl'
         (dim_0) <U3 'foo1' foo2' 'foo3' 'foo4' 'foo5'
Dimensions without coordinates: dim_0

Using a pandas.Index,

print(xarray.concat(data, pandas.Index(['foo1', 'foo2', 'foo3', 'foo4', 'foo5'], name='stat'))
<xarray.DataArray (stat: 5, index: 2)>
array([[ 14.9631  ,   0.354273],
       [ 19.982272,   0.555708],
       [ 14.974026,   0.60658 ],
       [ 24.841064,   0.750451],
       [ 24.841064,   0.750451]])
Coordinates:
  * index    (index) object 'Intercept' 'Lvl'
  * stat     (stat) object 'foo1' 'foo2' 'foo3' 'foo4' 'foo5'

I want the latter, not the former, but I expected the latter when using a DataArray.

@shoyer
Copy link
Member

shoyer commented Oct 23, 2017

Agreed, this should definitely work! (I think the fact that it doesn't is probably related to the relatively recent change that made coordinate labels along dimensions optional.)

@shoyer
Copy link
Member

shoyer commented Oct 23, 2017

This is the location of the helper function that parsing the dim argument:

def _calc_concat_dim_coord(dim):

@dcherian
Copy link
Contributor

dcherian commented Jan 9, 2018

I have a simple fix in #1812 .

With that change, this works.

print(xarray.concat(data, xarray.DataArray(['foo1', 'foo2', 'foo3', 'foo4', 'foo5'], name='stat'))

But if you provided a dimension name like

print(xarray.concat(data, xarray.DataArray(['foo1', 'foo2', 'foo3', 'foo4', 'foo5'], dims=['new_dim'], name='stat'))

then new_dim is renamed to stat. What is the desired behaviour in this case viz. do we preserve dimension name new_dim or assign new name stat?

@shoyer
Copy link
Member

shoyer commented Jan 10, 2018

What is the desired behaviour in this case viz. do we preserve dimension name new_dim or assign new name stat?

Oh, this is trickier than I thought!

The challenge is that once you make the DataArray, there is no good way to know if a default dimension name like 'dim_0' was intentional or not.

The way to handle this currently is to pass a 1-dimensional xarray.Variable object for the dim argument. These don't have separate names, so there's no ambiguity:

In [5]: xarray.concat([xarray.DataArray(1), xarray.DataArray(2)], dim=xarray.Variable('x', [3, 4]))
Out[5]:
<xarray.DataArray (x: 2)>
array([1, 2])
Coordinates:
  * x        (x) int64 3 4

But this is a little verbose. Potentially we could call xarray.as_variable() on tuple inputs (like in the Dataset constructor) so dim=('x', [3, 4]) works.

@stale
Copy link

stale bot commented Dec 11, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@max-sixty
Copy link
Collaborator

One option here would be to take the name of the dataarray iff the dim name is dim_0. While I'm not a great fan of breaking that abstraction and hardcoding exceptions, it does solve the case quite well:

In [7]: xr.concat([da,da], xr.DataArray(['x', 'y'], name='stat'))
Out[7]:
<xarray.DataArray 'air' (dim_0: 2, time: 2920, lat: 25, lon: 53)>
array([[[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
          238.59999],
         [243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
          295.19   ],
         [297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
          295.69   ]]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
  * dim_0    (dim_0) <U1 'x' 'y'  # Very unlikely we even want `dim_0` here

It's very unlikely we even want dim_0 there rather than stat

@dcherian
Copy link
Contributor

dcherian commented Apr 9, 2022

We could also make it only succeed if DataArray.name is None and raise an error otherwise.

Supporting the tuple form also seems like a good idea

@simonkeys
Copy link

Just pointing out that the documentation says "If dimension is provided as a DataArray or Index, its name is used as the dimension to concatenate along and the values are added as a coordinate." Which seems to currently be not true for a DataArray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants