fix: structure removal order for unions #3397

pfackeldey · 2025-02-07T19:20:17Z

ianna

@pfackeldey - Great! Thanks for providing the fix so quickly! Indeed, iterating over the unique index fixes the issue.

ianna · 2025-02-10T13:31:32Z

src/awkward/contents/unionarray.py

@@ -1574,7 +1574,8 @@ def _remove_structure(
        self, backend: Backend, options: RemoveStructureOptions
    ) -> list[Content]:
        out = []
-        for i in range(len(self._contents)):
+        _, unique_index, *_ = numpy.unique_all(self._tags.data)


@pfackeldey - should it be Numpy-like? To cover the cupy as a backend?

yes, this is to cover all backends (also typetracer).

oh, wait, you're right. I'm using the global numpy nplike. Let me have a look again to use the nplike of the content class... 🤦

pfackeldey · 2025-02-10T14:05:06Z

@pfackeldey - Great! Thanks for providing the fix so quickly! Indeed, iterating over the unique index fixes the issue.

this PR does however make the assumption that the number of unique tags corresponds to the len(self._contents), which I think should be always true. Could you confirm this is true? If yes, then I'm confident that this is the right fix 👍

agoose77 · 2025-02-10T14:34:24Z

@pfackeldey just reading the comments and not the diff -- len(contents) may not equal unique(tags) if a content is not indexed. I think we have code-paths that can break this invariant, e.g.

import awkward as ak
import numpy as np
x = ak.Array([1, "hi", 2, "bye"])
y = x[[0,2]]
assert np.unique(y.layout.tags).size == len(y.layout.contents)

pfackeldey · 2025-02-10T14:47:33Z

@pfackeldey just reading the comments and not the diff -- len(contents) may not equal unique(tags) if a content is not indexed. I think we have code-paths that can break this invariant, e.g.
import awkward as ak
import numpy as np
x = ak.Array([1, "hi", 2, "bye"])
y = x[[0,2]]
assert np.unique(y.layout.tags).size == len(y.layout.contents)

Ah I see. In this case this implementation is still working correct though:

# with this PR
x = ak.Array([1, "hi", 2, "bye"])
y = x[[0,2]]
ak.ravel(y)
# <Array [1, 2] type='2 * int64'>

on main this actually fails, which I would consider wrong aswell (so this PR might fix another bug aswell?):

x = ak.Array([1, "hi", 2, "bye"])
y = x[[0,2]]
ak.ravel(y)
# ... > AssertionError: cannot merge NumpyArray with ListOffsetArray

The loop over the unique ordered tags will pick (in order) the contents where tags point to. This is always less or equal to the length of contents. Do you see any problem with that @agoose77 ?

agoose77 · 2025-02-10T15:15:30Z

@pfackeldey the original code has two bugs (which are the same kind of bug)! This PR fixes one, but not the other I think.

The intention of remove_structure is to allow the caller to obtain a collection of in-order zero-copy (where possible) 1D buffers such that (approximately) A_ij -> B_k (in Einstein notation). That's only partially possible here. I think of unions a bit like a flattened matrix, i.e. B_k where B_k1 and B_k2 may come from different contents. As such, a contiguous array of B_k will require copying each union item into a contiguous buffer.

For trivial-unions (e.g. a union of integer and bool), the proper way to flatten the layout in an order-preserving way is to write a kernel that fills a new buffer with result = [contents[tags[i]][i] for i in self.length]. But, for non-trivial unions that's a lot harder to figure out. For each content contents[j], there may be N associated values that end up in the structureless result.

Moreover, as we are dealing with a union, to remove the structure probably means we want to simplify a union over the results such that the union disappears and the result is simplified to a common type.

So, maybe the proper way to do this is to build a new union via .simplified over [content[i]._carry(...)._remove_structure() ...] and throw an error if the union doesn't vanish.

But, I also feel compelled to say -- _remove_structure does a few different things. It might be that there are some contexts where we don't care about the order invariant in the final result (e.g. non-positional reducers). In that case, perhaps we want to accept a flag that lets us take a fast path involving less carrying. Although, even then we would need to ensure that the type of the result is correct (i.e. figure out the result type and cast each content[i]._carry(...)._remove_structure() ...] to that type).

pfackeldey added 2 commits February 7, 2025 14:15

fix: order of structure removal for unions

e4dc991

add test

8e2375c

pfackeldey requested a review from ianna February 7, 2025 19:20

pfackeldey temporarily deployed to docs February 7, 2025 19:31 — with GitHub Actions Inactive

ianna approved these changes Feb 10, 2025

View reviewed changes

ianna reviewed Feb 10, 2025

View reviewed changes

use index_nplike of tags instead of global numpy nplike

f3f675f

pfackeldey deployed to docs February 10, 2025 14:31 — with GitHub Actions View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: structure removal order for unions #3397

fix: structure removal order for unions #3397

pfackeldey commented Feb 7, 2025

ianna left a comment

ianna Feb 10, 2025

pfackeldey Feb 10, 2025

pfackeldey Feb 10, 2025

pfackeldey commented Feb 10, 2025

agoose77 commented Feb 10, 2025

pfackeldey commented Feb 10, 2025

agoose77 commented Feb 10, 2025 •

edited

Loading

fix: structure removal order for unions #3397

Are you sure you want to change the base?

fix: structure removal order for unions #3397

Conversation

pfackeldey commented Feb 7, 2025

ianna left a comment

Choose a reason for hiding this comment

ianna Feb 10, 2025

Choose a reason for hiding this comment

pfackeldey Feb 10, 2025

Choose a reason for hiding this comment

pfackeldey Feb 10, 2025

Choose a reason for hiding this comment

pfackeldey commented Feb 10, 2025

agoose77 commented Feb 10, 2025

pfackeldey commented Feb 10, 2025

agoose77 commented Feb 10, 2025 • edited Loading

agoose77 commented Feb 10, 2025 •

edited

Loading