-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]: Add a replace method #6377
Comments
I agree this would be useful, and I've had to do similar things. It's the sort of area where pandas is stronger than xarray. We might want a more specific name than @Huite thanks for the great proposal. Did you look at |
Yeah I think maybe I don't think you'd want to this with Having said that, there is a faster and much cleaner implementation using def custom_replace2(da, to_replace, value):
flat = da.values.ravel()
sorter = np.argsort(to_replace)
insertion = np.searchsorted(to_replace, flat, sorter=sorter)
indices = np.take(sorter, insertion, mode="clip")
replaceable = (to_replace[indices] == flat)
out = flat.copy()
out[replaceable] = value[indices[replaceable]]
return da.copy(data=out.reshape(da.shape))
# For small example: 4.1 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# For the larger example: # 14.4 ms ± 592 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit custom_replace2(da, to_replace, value) This is equal to the implementation of The key trick is the same, relying on sorting. See e.g. also: https://stackoverflow.com/questions/16992713/translate-every-element-in-numpy-array-according-to-key |
See also #5048 though the discussion here is more thorough. |
Thanks @dcherian for linking the other issue because that led me here. I'm all for this! Though I would like to add the consideration for doing this replacement in a coordinate, not just the data (parts of the suggested code like returning |
The coordinate is a DataArray as well, so the following would work: # Example DataArray
da = xr.DataArray(np.ones((3, 3)), {"y": [50.0, 60.0, 70.0], "x": [1.0, 2.0, 3.0]}, ("y", "x"))
# Replace 50.0 and 60.0 by 5.0 and 6.0 in the y coordinate
da["y"] = da["y"].replace_values([50.0, 60.0], [5.0, 6.0]) Your example in the other issue mentions one of the ways you'd replace in pandas, but for a dataframe. With a dataframe, there's quite some flexibility: df.replace({0: 10, 1: 100})
df.replace({'A': 0, 'B': 5}, 100)
df.replace({'A': {0: 100, 4: 400}}) I'd say the xarray counterpart of a Dataframe is a Dataset; the counterpart of a DataArray is a Series. Other thoughts: some complexity comes in when implementing a I think the question is what the signature should be. You could compare to reindex (https://xarray.pydata.org/en/stable/generated/xarray.Dataset.reindex.html) and have an "replacer" argument: da = da.replace({"y": ([50.0, 60.0], [5.0, 6.0])})
da["y"] = da["y"].replace([50.0, 60.0], [5.0, 6.0]) The first one would also work for Datasets, but I personally prefer the second one for it's simplicity (and which is maybe closer to |
@Huite Indeed, you are right that working with a coordinate is easy if it works for DataArrays ... this is a good example of my pandas-oriented brain not quite being used to xarray just yet (though I do love it). Regarding signature options for a Dataset ... given the two examples you state, I also personally prefer the look of the second one. However, the first one can be extremely useful for more complicated replacement needs because the input dict can be assembled programmatically prior to the replace call, for doing replaces in several subset DataArrays. I think the second version would require looping of some sort, or multiple calls at the very least. For me, in my context of renaming on coordinates (the index or columns in a DataFrame context), I often have to modify many things in both axes, which I do using one dictionary. I suppose it's a matter of preference and of ease of implementation ... since I'm not the one doing the coding, I shall definitely defer to others on the latter point! |
Thanks @Huite |
Is your feature request related to a problem?
If I have a DataArray of values:
And I'd like to replace
to_replace=[1, 3, 5]
byvalue=[10, 30, 50]
, there's no methodda.replace(to_replace, value)
to do this.There's no easy way like pandas (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) to do this.
(Apologies if I've missed related issues, searching for "replace" gives many hits as the word is obviously used quite often.)
Describe the solution you'd like
I've had a try at a relatively efficient implementation below. I'm wondering whether it's a worthwhile addition to xarray?
Describe alternatives you've considered
Ignoring issues such as dealing with NaNs, chunks, etc., a simple dict lookup:
Alternatively, leveraging pandas:
But I also tried my hand at a custom implementation, letting
np.unique
do the heavy lifting:Such an approach seems like it's consistently the fastest:
With the advantage growing the number of values involved:
In my real-life example, with a DataArray of approx 110 000 elements, with 60 000 values to replace, the custom one takes 33 ms, the dict one takes 135 ms, while pandas takes 26 s (!).
Additional context
In all cases, we need dealing with NaNs, checking the input, etc.:
It think it should be easy to use e.g. let it operate on the numpy arrays so e.g. apply_ufunc will work.
The primary issue is whether values can be sorted; in such a case the dict lookup might be an okay fallback?
I've had a peek at the pandas implementation, but didn't become much wiser.
Anyway, for your consideration! I'd be happy to submit a PR.
The text was updated successfully, but these errors were encountered: