Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use numbagg for ffill by default #8389

Merged
merged 16 commits into from
Nov 25, 2023
Merged

Conversation

max-sixty
Copy link
Collaborator

@max-sixty max-sixty commented Oct 28, 2023

The main perf advantage here is the array doesn't need to be unstacked & stacked, which is a huge win for large multi-dimensional arrays... (I actually was hitting a memory issue running an ffill on my own, and so thought I'd get this done!)

We could move these methods to DataWithCoords, since they're almost the same implementation between a DataArray & Dataset, and exactly the same for numbagg's implementation


For transparency — the logic of "check for numbagg, check for bottleneck" I wouldn't rate at my most confident. But I'm more confident that just installing numbagg will work. And if that works well enough, we could consider only supporting numbagg for some of these in the future.

I also haven't done the benchmarks here — though the functions are relatively well benchmarked at numbagg. I'm somewhat trading off getting through these (rolling functions are coming up too) vs. doing fewer slower, and leaning towards the former, but feedback welcome...

  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Oct 28, 2023
xarray/core/nputils.py Outdated Show resolved Hide resolved
@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Oct 29, 2023
@dcherian
Copy link
Contributor

the array doesn't need to be unstacked & stacked

Where does this happen?


We should be able to switch between bottleneck and numbagg here:

def push(array, n, axis):

and here:

def push(array, n, axis):

then you'll get dask-aware ffill for free :)

@max-sixty
Copy link
Collaborator Author

max-sixty commented Oct 29, 2023

the array doesn't need to be unstacked & stacked

Where does this happen?

Super weird — I have a stack trace from memray that has:

da.ffill("t")
ds = self._to_temp_dataset().unstack(dim, fill_value, sparse)
result = result._unstack_once(d, stacked_indexes[d], fill_value, sparse) 

It's a sampling profiler, so it misses frames. But after 10 mins of looking, I can't see how that would possibly happen — I can't see where this would get called.

Though IIRC bottleneck doesn't deal well with arrays with lots of dimensions, and that array has lots of dimensions; so I think it's still possible that something is unstacking to reduce the dimensions even I can't find it atm.

Anyway!!


We should be able to switch between bottleneck and numbagg here:

OK great!

@max-sixty
Copy link
Collaborator Author

I refactored this to use the duck_array_ops, that's much better. Regretfully all the previous code on this PR was just a worse version of that and was discarded.

I need to work through the version checks...

@max-sixty
Copy link
Collaborator Author

Unfortunately I'm seeing some segfaults when running with:

  • dask
  • target="parallel", which is the default for numbagg functions

This would not be a great experience. So I'll look at these before we consider merging; I suspect it'll be something upstream...

@max-sixty
Copy link
Collaborator Author

OK, this is all done I think:

  • Using duck array module, since
    def module_available(module: str) -> bool:
    doesn't get the version. (Open to anyone making changes there, I don't think it's perfect, and it's now oddly named, though I probably won't be the person to make more refinements there...)
  • numbagg is fixed to switch to target="cpu" when running in multithreading
  • I also raised that issue with numba

@max-sixty max-sixty added the plan to merge Final call for comments label Nov 24, 2023
Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits

xarray/core/duck_array_ops.py Outdated Show resolved Hide resolved
xarray/core/nputils.py Outdated Show resolved Hide resolved
xarray/core/nputils.py Outdated Show resolved Hide resolved
xarray/core/rolling_exp.py Outdated Show resolved Hide resolved
raise ImportError(
"numbagg >= 0.2.1 is required for rolling_exp but currently numbagg is not installed"
)
elif _NUMBAGG_VERSION < Version("0.2.1"):
elif pycompat.mod_version("numbagg") < Version("0.2.1"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we boost min supported numbagg version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we basically keep the 12 month cycle — it's some good infra, and I would like to be a good citizen for that process rather than deviate to save myself some boilerplate...

xarray/tests/test_missing.py Outdated Show resolved Hide resolved
@@ -50,6 +50,10 @@ Documentation
Internal Changes
~~~~~~~~~~~~~~~~

- :py:meth:`DataArray.bfill` & :py:meth:`DataArray.ffill` now use numbagg by
default, which is up to 5x faster on wide arrays on multi-core machines. (:pull:`8339`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are "wide" arrays?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a better term? I agree it's not great now. A "wide" table has lots of column that it can parallelize along. i.e. the benchmarks at https://github.com/numbagg/numbagg/#nd are 5x better than bottleneck, but just above, on a single column, it's identical, because the benefit comes from the parallelism.

max-sixty and others added 6 commits November 24, 2023 15:12
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
@max-sixty max-sixty merged commit a683790 into pydata:main Nov 25, 2023
28 checks passed
@max-sixty max-sixty deleted the ffill-numbagg branch November 25, 2023 21:06
dcherian added a commit to rabernat/xarray that referenced this pull request Nov 29, 2023
* upstream/main:
  Raise an informative error message when object array has mixed types (pydata#4700)
  Start renaming `dims` to `dim` (pydata#8487)
  Reduce redundancy between namedarray and variable tests (pydata#8405)
  Fix Zarr region transpose (pydata#8484)
  Refine rolling_exp error messages (pydata#8485)
  Use numbagg for `ffill` by default (pydata#8389)
  Fix bug for categorical pandas index with categories with EA dtype (pydata#8481)
  Improve "variable not found" error message (pydata#8474)
  Add whatsnew for pydata#8475 (pydata#8478)
  Allow `rank` to run on dask arrays (pydata#8475)
  Fix mypy tests (pydata#8476)
  Use concise date format when plotting (pydata#8449)
  Fix `map_blocks` docs' formatting (pydata#8464)
  Consolidate `_get_alpha` func (pydata#8465)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow topic-arrays related to flexible array support topic-backends topic-dask topic-rolling topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants