Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataArray.argsort should be deleted #1635

Open
seth-p opened this issue Oct 17, 2017 · 7 comments
Open

DataArray.argsort should be deleted #1635

seth-p opened this issue Oct 17, 2017 · 7 comments

Comments

@seth-p
Copy link
Contributor

seth-p commented Oct 17, 2017

Originally posted to https://groups.google.com/forum/#!topic/xarray/wsxeiIPLhgM

DataArray.argsort() appears to simply wrap the result of DataArray.values.argsort() in a same-shape DataArray. This is semantically nonsensical. If anything the index on the resulting argsort() values should simply be range(len(da)), but that adds little to the underlying numpy structure. And there's not much reason to have da.argsort() simply return the (raw) result of da.values.argsort(). So really DataArray.argsort() should simply be deleted.

On the other hand a new function DataArray.rank() that wraps da.values.argsort().argsort() (note repeated call to ndarray.argsort()) in the structure of the original DataArray would make sense, and perhaps even be useful... (Note that I'm not claiming that .argsort().argsort() is the fastest way to calculate this, but it's probably good enough, at least for an initial implementation.)

@shoyer
Copy link
Member

shoyer commented Oct 17, 2017

It seems like another reasonable choice would be for DataArray.argsort() to keep its current value but drop coordinate labels along the sorting dimension. This would give a consistent result for use with broadcasting indexing (#1473).

I think this internal utility function is equivalent to your second argsort:

def inverse_permutation(indices):
"""Return indices for an inverse permutation.
Parameters
----------
indices : 1D np.ndarray with dtype=int
Integer positions to assign elements to.
Returns
-------
inverse_permutation : 1D np.ndarray with dtype=int
Integer indices to take from the original array to create the
permutation.
"""
# use intp instead of int64 because of windows :(
inverse_permutation = np.empty(len(indices), dtype=np.intp)
inverse_permutation[indices] = np.arange(len(indices), dtype=np.intp)
return inverse_permutation

In practice, we might use bottleneck.rankdata() or nanrankdata():
https://kwgoodman.github.io/bottleneck-doc/reference.html#non-reduce-with-axis

@seth-p
Copy link
Contributor Author

seth-p commented Oct 18, 2017

I think that makes sense, though I don't quite understand what would go in its place. Another possibility -- perhaps a bad one -- is to permute the values in the sorted dimension so that they match the resulting values (i.e. something like result.coords[dim] = np.take(da.coords[dim].values, result.values, axis=axis)).

Note that ndarray.argsort(axis=None) sorts the flattened array, so the returned DataArray should respect this

Alternative suggestion: have DataArray.argsort() return an ndarray filled with labels from the sorted dimension, i.e. something like:

class DataArray:
    def argsort(self, **kwargs):
        # TODO: update kwargs['axis'] based 'axis' and 'dim', and remove 'dim'
        if kwargs['axis'] is None:
            kwargs['axis'] = -1
            return self.stack(dim=self.dims).argsort(**kwargs)
        return np.take(self.coords[self.dims[kwargs['axis']].values, self.values.argsort(**kwargs))

BTW, I'm just thinking in terms of ndarrays. Someone more knowledgeable than me may want to consider how to make it work intelligently with dask.

@shoyer
Copy link
Member

shoyer commented Oct 18, 2017

Note that ndarray.argsort(axis=None) sorts the flattened array, so the returned DataArray should respect this

I'm not a huge fan of auto-flattening for xarray, but I can see this logic.

Alternative suggestion: have DataArray.argsort() return an ndarray filled with labels from the sorted dimension

I would probably implement a new method for this, maybe idxsort for symmetry with idxmax (#60). Though that name could be read several different ways. Maybe sortby_labels()?

@seth-p
Copy link
Contributor Author

seth-p commented Oct 18, 2017

I'm not a fan of auto-flattening either, but that's what nd.argsort() does...

One option is to have DataArray.arg{min,max,sort}() all take an optional flag argument specifying whether to return integer indices or index labels. But I think my preference would be be to have six separate functions: DataArray.{idx,}arg{min,max,sort}() (or some such nomenclature that includes arg in all six functions).

@stale
Copy link

stale bot commented Oct 22, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Oct 22, 2019
@seth-p
Copy link
Contributor Author

seth-p commented Oct 23, 2019

I think this issue is still relevant.

@stale stale bot removed the stale label Oct 23, 2019
@anton-seaice
Copy link

I agree that at least the index should show the reflect the order of the ranking found using argsort.

import xarray as xr
import numpy as np

np.random.rand(10)

ds = xr.DataArray(np.random.rand(10) )

ds.argsort().dim_0

returns
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

however it should return an array with the order of the elements ranked using argsort
e.g.

array([3, 8, 5, 1, 7, 2, 9, 4, 6, 0])

Where ds[3] is the smallest value in ds and ds[0] is the biggest etc.

The current behaviour is confusing / nonsenical. Return the index unchanged implies that the initial array is already ordered.


INSTALLED VERSIONS
------------------
commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-425.3.1.el8.nci.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.9.3
netCDF4: 1.6.0
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: None
iris: 3.3.1
bottleneck: None
dask: 2022.11.0
distributed: 2022.11.0
matplotlib: 3.4.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2022.11.0
cupy: None
pint: None
sparse: 0.13.0
flox: None
numpy_groupies: None
setuptools: 65.5.1
pip: 23.0.1
conda: None
pytest: None
mypy: None
IPython: 8.6.0
sphinx: None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants