Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for dask-awkward arrays in vector constructors #429

Merged
merged 14 commits into from
Mar 5, 2024

Conversation

Saransh-cpp
Copy link
Member

@Saransh-cpp Saransh-cpp commented Feb 28, 2024

Description

Fixes #301

Kindly take a look at CONTRIBUTING.md.

Please describe the purpose of this pull request. Reference and link to any relevant issues or pull requests.

Checklist

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't any other open Pull Requests for the required change?
  • Does your submission pass pre-commit? ($ pre-commit run --all-files or $ nox -s lint)
  • Does your submission pass tests? ($ pytest or $ nox -s tests)
  • Does the documentation build with your changes? ($ cd docs; make clean; make html or $ nox -s docs)
  • Does your submission pass the doctests? ($ pytest --doctest-plus src/vector/ or $ nox -s doctests)

Before Merging

  • Summarize the commit messages into a brief review of the Pull request.

@Saransh-cpp Saransh-cpp marked this pull request as draft February 28, 2024 12:59
Copy link

codecov bot commented Feb 28, 2024

Codecov Report

Attention: Patch coverage is 84.61538% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 82.81%. Comparing base (6e1f79f) to head (57706a8).
Report is 2 commits behind head on main.

Files Patch % Lines
src/vector/backends/awkward_constructors.py 84.61% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #429      +/-   ##
==========================================
- Coverage   82.92%   82.81%   -0.11%     
==========================================
  Files          96       96              
  Lines       11533    11559      +26     
==========================================
+ Hits         9564     9573       +9     
- Misses       1969     1986      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Saransh-cpp Saransh-cpp marked this pull request as ready for review February 28, 2024 13:24
Comment on lines 323 to 334
needs_behavior = not vector._awkward_registered
for x in arrays:
if needs_behavior:
if x.behavior is None:
x.behavior = vector.backends.awkward.behavior
else:
x.behavior = dict(x.behavior)
x.behavior.update(vector.backends.awkward.behavior)
else:
x.behavior = None
needs_behavior = False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this piece of code is still required. If it is, is there a way to set behavior for dask_awkward arrays?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block looks important. We don't want to add a giant behavior dict to every array if the relevant behaviors have already been installed globally.

All of the ak.* functions can be used on dask-awkward arrays1, so the code block that you have below, which builds the vector by calling ak.with_name passing the behavior gets it into the new dask-awkward array.

Footnotes

  1. Any ak.* function that takes ak.Arrays as arguments and returns an ak.Array can also be a function that takes dak.Arrays as arguments and returns a dak.Array. Awkward now has an overload mechanism like NumPy's.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Given that behavior of a dask_awkward.Array has no setter, should I use conditional statements here to preserve the code block (subsequently using if while passing behavior in awkward.with_name) or is there a better way to do this?

if vector._is_awkward_v2 and not isinstance(args[0], dask_awkward.Array):
    needs_behavior = not vector._awkward_registered
    for x in arrays:
        if needs_behavior:
            if x.behavior is None:
                x.behavior = vector.backends.awkward.behavior
            else:
                x.behavior = dict(x.behavior)
                x.behavior.update(vector.backends.awkward.behavior)
        else:
            x.behavior = None
        needs_behavior = False

assert 2 <= dimension <= 4, f"Dimension must be between 2-4, not {dimension}"

return awkward.with_name(
    awkward.zip(
        dict(__builtins__["zip"](names, arrays)),  # type: ignore[index]
        depth_limit=akarray.layout.purelist_depth,
    ),
    _recname(is_momentum, dimension),
    behavior=vector.backends.awkward.behavior
    if vector._is_awkward_v2 and isinstance(args[0], dask_awkward.Array))
    else None,
)

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, but I think you need the block that checks to see if the Vector behaviors have been installed globally before deciding whether to attach them all to the output array.

Comment on lines 323 to 334
needs_behavior = not vector._awkward_registered
for x in arrays:
if needs_behavior:
if x.behavior is None:
x.behavior = vector.backends.awkward.behavior
else:
x.behavior = dict(x.behavior)
x.behavior.update(vector.backends.awkward.behavior)
else:
x.behavior = None
needs_behavior = False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block looks important. We don't want to add a giant behavior dict to every array if the relevant behaviors have already been installed globally.

All of the ak.* functions can be used on dask-awkward arrays1, so the code block that you have below, which builds the vector by calling ak.with_name passing the behavior gets it into the new dask-awkward array.

Footnotes

  1. Any ak.* function that takes ak.Arrays as arguments and returns an ak.Array can also be a function that takes dak.Arrays as arguments and returns a dak.Array. Awkward now has an overload mechanism like NumPy's.

@Saransh-cpp Saransh-cpp force-pushed the dask-awkward-supprt branch from f50e2da to 4c9b5e8 Compare March 5, 2024 14:06
@Saransh-cpp Saransh-cpp requested a review from jpivarski March 5, 2024 14:37
@Saransh-cpp
Copy link
Member Author

@jpivarski, could you please take a quick look at this again? I'll merge this today. Thanks!

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is right: you just use ak.with_name and it does the right thing for dask_awkward.Arrays.

I'm a little unsure of the typing, and this goal of having mypy be able to type-check the library and user code might be competing with other goals of simplicity and not getting tangled in dependencies that may be unnecessary at runtime.

Both goals should be attainable. I'd like someone with more Python typing experience to check it, maybe @henryiii? (If you're not too busy?)

Comment on lines 314 to 319
ak_array_types: tuple[typing.Any, ...] = (awkward.Array,)

if vector._is_awkward_v2:
import dask_awkward # type: ignore[import-not-found]

ak_array_types += (dask_awkward.Array,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of why I think we should drop Awkward 1 support soon. Generally, the typing code should be declarative.

By the way, I think you can declare a dask_awkward.Array dependency without importing dask_awkward by using a string: "dask_awkward.Array". I'm not 100% sure about typing syntax; maybe this only works past a particular Python version. However, we want to be able to allow some types without necessarily importing them (a start-up time cost) or even having them installed (a dependency-complexity cost).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yes, that makes sense. I could remove awkward v1 support in this PR itself, but it might be better to do that in the v1.3.1/v1.4.0 release. PEP 484 mentions forward referencing to deal with such cases, but that will only be helpful while adding type hints.

I have pushed a commit rewriting the logic for handling dask_awkward arrays. The implementation does not require importing dask_awkward. Please let me know if that works!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping a major dependency version should happen at some significant version threshold in Vector. Maybe 1.3.x → 1.4.0, or maybe 1.x → 2.0.0.

I like the idea of 2.0.0 (Uproot increases major version when Awkward increases major version), but this has scared users in the past: seeing Uproot 4 → 5 made some people think that the changes in Uproot were major, rather than the changes in its dependencies.

So maybe 1.4.0, yeah.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll drop awkward v1 in the next release.

@Saransh-cpp Saransh-cpp merged commit e219431 into scikit-hep:main Mar 5, 2024
22 of 23 checks passed
@Saransh-cpp Saransh-cpp deleted the dask-awkward-supprt branch March 5, 2024 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for dask-awkward arrays
2 participants