Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't project categorical in ak._v2.packed #1689

Merged
merged 3 commits into from
Sep 8, 2022

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Sep 8, 2022

Fixes #1688 by only packing the Indexed[Option]Array's contents if it is a categorical type.

In the long run, it would be nice if we could mark a layout node as non-transient so that we can implement support for this without hard-coding a test for __array__ == "categorical"

FIXME: in future we should be able to specify this at the behaviour level
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! For the record (we talked about this on Slack), we decided to keep this simple and not remove unused elements from the content. Also, @ivirshup, who raised the issue, argued against it in #1688 (comment).

src/awkward/_v2/contents/indexedarray.py Show resolved Hide resolved
src/awkward/_v2/contents/indexedarray.py Show resolved Hide resolved
@jpivarski
Copy link
Member

Also for the sake of history, this was @agoose77's attempt to remove unused contents:

        if self.parameter("__array__") == "categorical":
            index_nplike = self._nplike.index_nplike
            index = index_nplike.asarray(self._index.data)
            mask = index_nplike.zeros(len(self._content), dtype=np.bool_)
            mask[index] = True
            tag_valid, = index_nplike.nonzero(mask)
            index_valid = index_nplike.searchsorted(tag_valid, index)
            return IndexedArray(
                ak._v2.index.Index(index_valid),
                self._content._carry(ak._v2.index.Index(tag_valid), False),
                self._identifier,
                self._parameters,
                self._nplike
            )

and I was thinking about it along these lines:

>>> index = np.array([4, 2, 6, 1, 5, 5, 2])
>>> content = np.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
>>> uniques, inverse = np.unique(index, return_inverse=True)
>>> uniques
array([1, 2, 4, 5, 6])
>>> inverse
array([2, 1, 4, 0, 3, 3, 1])
>>> uniques[inverse]
array([4, 2, 6, 1, 5, 5, 2])

Both of these are O(n log n), either in the np.searchsorted or in the np.unique (because it sorts). For what we want, it wouldn't have to sort, but we might have needed to write a custom kernel to actually get O(n).

@jpivarski jpivarski merged commit 3da5f32 into main Sep 8, 2022
@jpivarski jpivarski deleted the agoose77/fix-categorical-packed-2 branch September 8, 2022 16:36
@jpivarski
Copy link
Member

Since this merged PR is based on a branch named agoose77/fix-categorical-packed-2, would it be safe to say that agoose77/fix-categorical-packed can be deleted? If so, please do it. (I'm cleaning up dead branches.)

@agoose77
Copy link
Collaborator Author

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ak.packed casts categorical to value type
2 participants