DataArray.argsort should be deleted #1635

seth-p · 2017-10-17T13:52:54Z

Originally posted to https://groups.google.com/forum/#!topic/xarray/wsxeiIPLhgM

DataArray.argsort() appears to simply wrap the result of DataArray.values.argsort() in a same-shape DataArray. This is semantically nonsensical. If anything the index on the resulting argsort() values should simply be range(len(da)), but that adds little to the underlying numpy structure. And there's not much reason to have da.argsort() simply return the (raw) result of da.values.argsort(). So really DataArray.argsort() should simply be deleted.

On the other hand a new function DataArray.rank() that wraps da.values.argsort().argsort() (note repeated call to ndarray.argsort()) in the structure of the original DataArray would make sense, and perhaps even be useful... (Note that I'm not claiming that .argsort().argsort() is the fastest way to calculate this, but it's probably good enough, at least for an initial implementation.)

The text was updated successfully, but these errors were encountered:

shoyer · 2017-10-17T22:04:26Z

It seems like another reasonable choice would be for DataArray.argsort() to keep its current value but drop coordinate labels along the sorting dimension. This would give a consistent result for use with broadcasting indexing (#1473).

I think this internal utility function is equivalent to your second argsort:

xarray/xarray/core/nputils.py

Lines 38 to 55 in 2949558

    
           def inverse_permutation(indices): 
        
               """Return indices for an inverse permutation. 
        
               Parameters 
        
               ---------- 
        
               indices : 1D np.ndarray with dtype=int 
        
                   Integer positions to assign elements to. 
        
               Returns 
        
               ------- 
        
               inverse_permutation : 1D np.ndarray with dtype=int 
        
                   Integer indices to take from the original array to create the 
        
                   permutation. 
        
               """ 
        
               # use intp instead of int64 because of windows :( 
        
               inverse_permutation = np.empty(len(indices), dtype=np.intp) 
        
               inverse_permutation[indices] = np.arange(len(indices), dtype=np.intp) 
        
               return inverse_permutation

In practice, we might use bottleneck.rankdata() or nanrankdata():
https://kwgoodman.github.io/bottleneck-doc/reference.html#non-reduce-with-axis

seth-p · 2017-10-18T15:08:57Z

I think that makes sense, though I don't quite understand what would go in its place. Another possibility -- perhaps a bad one -- is to permute the values in the sorted dimension so that they match the resulting values (i.e. something like result.coords[dim] = np.take(da.coords[dim].values, result.values, axis=axis)).

Note that ndarray.argsort(axis=None) sorts the flattened array, so the returned DataArray should respect this

Alternative suggestion: have DataArray.argsort() return an ndarray filled with labels from the sorted dimension, i.e. something like:

class DataArray:
    def argsort(self, **kwargs):
        # TODO: update kwargs['axis'] based 'axis' and 'dim', and remove 'dim'
        if kwargs['axis'] is None:
            kwargs['axis'] = -1
            return self.stack(dim=self.dims).argsort(**kwargs)
        return np.take(self.coords[self.dims[kwargs['axis']].values, self.values.argsort(**kwargs))

BTW, I'm just thinking in terms of ndarrays. Someone more knowledgeable than me may want to consider how to make it work intelligently with dask.

shoyer · 2017-10-18T15:31:56Z

Note that ndarray.argsort(axis=None) sorts the flattened array, so the returned DataArray should respect this

I'm not a huge fan of auto-flattening for xarray, but I can see this logic.

Alternative suggestion: have DataArray.argsort() return an ndarray filled with labels from the sorted dimension

I would probably implement a new method for this, maybe idxsort for symmetry with idxmax (#60). Though that name could be read several different ways. Maybe sortby_labels()?

seth-p · 2017-10-18T17:48:28Z

I'm not a fan of auto-flattening either, but that's what nd.argsort() does...

One option is to have DataArray.arg{min,max,sort}() all take an optional flag argument specifying whether to return integer indices or index labels. But I think my preference would be be to have six separate functions: DataArray.{idx,}arg{min,max,sort}() (or some such nomenclature that includes arg in all six functions).

stale · 2019-10-22T06:58:59Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

seth-p · 2019-10-23T04:35:37Z

I think this issue is still relevant.

anton-seaice · 2023-03-10T02:31:27Z

I agree that at least the index should show the reflect the order of the ranking found using argsort.

import xarray as xr
import numpy as np

np.random.rand(10)

ds = xr.DataArray(np.random.rand(10) )

ds.argsort().dim_0

returns
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

however it should return an array with the order of the elements ranked using argsort
e.g.

array([3, 8, 5, 1, 7, 2, 9, 4, 6, 0])

Where ds[3] is the smallest value in ds and ds[0] is the biggest etc.

The current behaviour is confusing / nonsenical. Return the index unchanged implies that the initial array is already ordered.


INSTALLED VERSIONS
------------------
commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-425.3.1.el8.nci.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.9.3
netCDF4: 1.6.0
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: None
iris: 3.3.1
bottleneck: None
dask: 2022.11.0
distributed: 2022.11.0
matplotlib: 3.4.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2022.11.0
cupy: None
pint: None
sparse: 0.13.0
flox: None
numpy_groupies: None
setuptools: 65.5.1
pip: 23.0.1
conda: None
pytest: None
mypy: None
IPython: 8.6.0
sphinx: None

0x0L mentioned this issue Nov 20, 2017

Rank function #1731

Closed

jhamman added the design question label Nov 21, 2017

stale bot added the stale label Oct 22, 2019

stale bot removed the stale label Oct 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataArray.argsort should be deleted #1635

DataArray.argsort should be deleted #1635

seth-p commented Oct 17, 2017

shoyer commented Oct 17, 2017 •

edited

Loading

seth-p commented Oct 18, 2017

shoyer commented Oct 18, 2017

seth-p commented Oct 18, 2017 •

edited

Loading

stale bot commented Oct 22, 2019

seth-p commented Oct 23, 2019

anton-seaice commented Mar 10, 2023

DataArray.argsort should be deleted #1635

DataArray.argsort should be deleted #1635

Comments

seth-p commented Oct 17, 2017

shoyer commented Oct 17, 2017 • edited Loading

seth-p commented Oct 18, 2017

shoyer commented Oct 18, 2017

seth-p commented Oct 18, 2017 • edited Loading

stale bot commented Oct 22, 2019

seth-p commented Oct 23, 2019

anton-seaice commented Mar 10, 2023

shoyer commented Oct 17, 2017 •

edited

Loading

seth-p commented Oct 18, 2017 •

edited

Loading