Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: structure removal order for unions #3397

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pfackeldey
Copy link
Collaborator

Fixes: #3180

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfackeldey - Great! Thanks for providing the fix so quickly! Indeed, iterating over the unique index fixes the issue.

@@ -1574,7 +1574,8 @@ def _remove_structure(
self, backend: Backend, options: RemoveStructureOptions
) -> list[Content]:
out = []
for i in range(len(self._contents)):
_, unique_index, *_ = numpy.unique_all(self._tags.data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfackeldey - should it be Numpy-like? To cover the cupy as a backend?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is to cover all backends (also typetracer).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, wait, you're right. I'm using the global numpy nplike. Let me have a look again to use the nplike of the content class... 🤦

@pfackeldey
Copy link
Collaborator Author

@pfackeldey - Great! Thanks for providing the fix so quickly! Indeed, iterating over the unique index fixes the issue.

this PR does however make the assumption that the number of unique tags corresponds to the len(self._contents), which I think should be always true. Could you confirm this is true? If yes, then I'm confident that this is the right fix 👍

@agoose77
Copy link
Collaborator

@pfackeldey just reading the comments and not the diff -- len(contents) may not equal unique(tags) if a content is not indexed. I think we have code-paths that can break this invariant, e.g.

import awkward as ak
import numpy as np
x = ak.Array([1, "hi", 2, "bye"])
y = x[[0,2]]
assert np.unique(y.layout.tags).size == len(y.layout.contents)

@pfackeldey
Copy link
Collaborator Author

@pfackeldey just reading the comments and not the diff -- len(contents) may not equal unique(tags) if a content is not indexed. I think we have code-paths that can break this invariant, e.g.

import awkward as ak
import numpy as np
x = ak.Array([1, "hi", 2, "bye"])
y = x[[0,2]]
assert np.unique(y.layout.tags).size == len(y.layout.contents)

Ah I see. In this case this implementation is still working correct though:

# with this PR
x = ak.Array([1, "hi", 2, "bye"])
y = x[[0,2]]
ak.ravel(y)
# <Array [1, 2] type='2 * int64'>

on main this actually fails, which I would consider wrong aswell (so this PR might fix another bug aswell?):

x = ak.Array([1, "hi", 2, "bye"])
y = x[[0,2]]
ak.ravel(y)
# ... > AssertionError: cannot merge NumpyArray with ListOffsetArray

The loop over the unique ordered tags will pick (in order) the contents where tags point to. This is always less or equal to the length of contents. Do you see any problem with that @agoose77 ?

@agoose77
Copy link
Collaborator

agoose77 commented Feb 10, 2025

@pfackeldey the original code has two bugs (which are the same kind of bug)! This PR fixes one, but not the other I think.

The intention of remove_structure is to allow the caller to obtain a collection of in-order zero-copy (where possible) 1D buffers such that (approximately) A_ij -> B_k (in Einstein notation). That's only partially possible here. I think of unions a bit like a flattened matrix, i.e. B_k where B_k1 and B_k2 may come from different contents. As such, a contiguous array of B_k will require copying each union item into a contiguous buffer.

For trivial-unions (e.g. a union of integer and bool), the proper way to flatten the layout in an order-preserving way is to write a kernel that fills a new buffer with result = [contents[tags[i]][i] for i in self.length]. But, for non-trivial unions that's a lot harder to figure out. For each content contents[j], there may be N associated values that end up in the structureless result.

Moreover, as we are dealing with a union, to remove the structure probably means we want to simplify a union over the results such that the union disappears and the result is simplified to a common type.

So, maybe the proper way to do this is to build a new union via .simplified over [content[i]._carry(...)._remove_structure() ...] and throw an error if the union doesn't vanish.

But, I also feel compelled to say -- _remove_structure does a few different things. It might be that there are some contexts where we don't care about the order invariant in the final result (e.g. non-positional reducers). In that case, perhaps we want to accept a flag that lets us take a fast path involving less carrying. Although, even then we would need to ensure that the type of the result is correct (i.e. figure out the result type and cast each content[i]._carry(...)._remove_structure() ...] to that type).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fill_none followed by ravel reorders arrays
3 participants