-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
polyval: Use Horner's algorithm + support chunked inputs #6548
Conversation
Several open points still:
|
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
I noticed that broadcasting Datasets behaves weird, see #6549, so I used a "hack" of adding an 0-valued DataArray/Dataset. |
This looks like excellent code @headtr1ck , especially so for a first PR. Welcome!
I think it's fine to do the hack and reference that issue; we can clean it up then (though if others have ideas then great).
IIRC we generally leave this up to the caller (and generally discourage mutation). Someone who knows better should review the algo & opine on the datetime issue; I agree I'm not sure whether we need to support them. |
Some performance comparison: With 5th order polynomial and 10000 x-values: With 5th order polynomial and 1mio x-values: As expected for small arrays the new method creates some overhead, but for larger arrays the speedup is quite nice. |
I added a rough support for datetime values. Someone with more knowledge of handling them should take a look, the code seems too complicated and I am sure there is a more clever solution (I could not use
I think keeping support is nice, since they are a commonly occuring coordinates and we do not want to break anything if possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @headtr1ck . What a great PR! This looks like a great improvement; the tests are certainly more readable.
I've left some minor comments.
One optional request: since you seem to have benchmark code, can you add it to benchmarks/polyfit.py
(new file)? If this is too much, just add the code in a comment here and I'll send in a followup PR. We have some documentation on using asv
here.
for more information, see https://pre-commit.ci
If you're confident this is faster and it's not a trivial amount of work to adjust them, I would leave it... |
Edit: nvmd, was only confusing output when the benchmark was failing. Now the benchmark looks good :) First time working with asv... We should refactor all dataarrays into the setup functions, otherwise O(n) memory algos will show wrong numbers and adding new tests will show regression on other tests. |
Nice this looks like an improvement for everything other than dask arrays with only 100 elements, which is not a good use-case for dask. I was slightly concerned that the recrusive algorithm wouldn't work well with dask but it does seem to work better. def other_polyval(coord, coeffs, degree_dim="degree"):
x = coord.data
deg_coord = coeffs[degree_dim]
N = int(deg_coord.max()) + 1
lhs = xr.DataArray(
np.stack([x ** (N - 1 - i) for i in range(N)], axis=1),
dims=(coord.name, degree_dim),
coords={
coord.name: coord.data,
degree_dim: np.arange(deg_coord.max() + 1)[::-1],
},
)
return xr.dot(lhs, coeffs, dims=degree_dim)
coeffs = xr.DataArray(np.random.randn(2), dims="degree")
da = xr.DataArray(dask.array.random.random((10**6), chunks=(10000)), dims=("x"))
print(len(da.data.dask))
print(len(xr.polyval(da, coeffs).data.dask))
print(len(other_polyval(da, coeffs).data.dask))
|
One minor open point: what to do with a non-integer "degree" index? |
Personally I would allow coeffs without explicit index since I am a lazy person and would like to do But I am happy with this code and look forward to use it in my projects :) |
Forcing the user to be explicit reduces bugs and user support requests :) so we like to do that. Thanks again @headtr1ck this is a great PR! |
Thanks @headtr1ck ! |
* main: (24 commits) Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598) Enable flox in GroupBy and resample (pydata#5734) Add setuptools as dependency in ASV benchmark CI (pydata#6609) change polyval dim ordering (pydata#6601) re-add timedelta support for polyval (pydata#6599) Minor Dataset.map docstr clarification (pydata#6595) New inline_array kwarg for open_dataset (pydata#6566) Fix polyval overloads (pydata#6593) Restore old MultiIndex dropping behaviour (pydata#6592) [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558) Fix zarr append dtype checks (pydata#6476) Add missing space in exception message (pydata#6590) Doc Link to accessors list in extending-xarray.rst (pydata#6587) Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579) Add some warnings about rechunking to the docs (pydata#6569) [pre-commit.ci] pre-commit autoupdate (pydata#6584) terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583) Allow string formatting of scalar DataArrays (pydata#5981) Fix mypy issues & reenable in tests (pydata#6581) polyval: Use Horner's algorithm + support chunked inputs (pydata#6548) ...
commit 398f1b6 Author: dcherian <deepak@cherian.net> Date: Fri May 20 08:47:56 2022 -0600 Backward compatibility dask commit bde40e4 Merge: 0783df3 4cae8d0 Author: dcherian <deepak@cherian.net> Date: Fri May 20 07:54:48 2022 -0600 Merge branch 'main' into dask-datetime-to-numeric * main: concatenate docs style (pydata#6621) Typing for open_dataset/array/mfdataset and to_netcdf/zarr (pydata#6612) {full,zeros,ones}_like typing (pydata#6611) commit 0783df3 Merge: 5cff4f1 8de7061 Author: dcherian <deepak@cherian.net> Date: Sun May 15 21:03:50 2022 -0600 Merge branch 'main' into dask-datetime-to-numeric * main: (24 commits) Fix overflow issue in decode_cf_datetime for dtypes <= np.uint32 (pydata#6598) Enable flox in GroupBy and resample (pydata#5734) Add setuptools as dependency in ASV benchmark CI (pydata#6609) change polyval dim ordering (pydata#6601) re-add timedelta support for polyval (pydata#6599) Minor Dataset.map docstr clarification (pydata#6595) New inline_array kwarg for open_dataset (pydata#6566) Fix polyval overloads (pydata#6593) Restore old MultiIndex dropping behaviour (pydata#6592) [docs] add Dataset.assign_coords example (pydata#6336) (pydata#6558) Fix zarr append dtype checks (pydata#6476) Add missing space in exception message (pydata#6590) Doc Link to accessors list in extending-xarray.rst (pydata#6587) Fix Dataset/DataArray.isel with drop=True and scalar DataArray indexes (pydata#6579) Add some warnings about rechunking to the docs (pydata#6569) [pre-commit.ci] pre-commit autoupdate (pydata#6584) terminology.rst: fix link to Unidata's "netcdf_dataset_components" (pydata#6583) Allow string formatting of scalar DataArrays (pydata#5981) Fix mypy issues & reenable in tests (pydata#6581) polyval: Use Horner's algorithm + support chunked inputs (pydata#6548) ... commit 5cff4f1 Merge: dfe200d 6144c61 Author: Maximilian Roos <5635139+max-sixty@users.noreply.github.com> Date: Sun May 1 15:16:33 2022 -0700 Merge branch 'main' into dask-datetime-to-numeric commit dfe200d Author: dcherian <deepak@cherian.net> Date: Sun May 1 11:04:03 2022 -0600 Minor cleanup commit 35ed378 Author: dcherian <deepak@cherian.net> Date: Sun May 1 10:57:36 2022 -0600 Support dask arrays in datetime_to_numeric
whats-new.rst