Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Implement bitmap unpacking #450

Merged
merged 4 commits into from
May 2, 2024

Conversation

paleolimbot
Copy link
Member

In prototyping a real-world use case, I remembered that unpacking bits is exceedingly difficult to get right if you need to support an arbitrary offset/length. The math for this is very fiddly and we spent a few rounds getting it right in the C function ArrowBitsUnpackInt(8|32). This PR makes that available so that we can do things like (1) convert bool arrays to numpy and (2) convert null masks to something that somebody else can work with (e.g., a numpy mask).

This seems to be relatively performant (thanks to @WillAyd's work optimizing this!)

import numpy as np
import nanoarrow as na
import pyarrow as pa

bool_np = np.random.random(int(1e6)) > 0.5
bool_na = na.Array(iter(bool_array), na.bool_())
bool_pa = pa.array(bool_np)

def to_numpy_na(x):
    x_view = na.c_array(x).view()
    out = np.empty(x_view.length, bool)
    x_view.buffer(1).unpack_bits_into(out, x_view.offset, x_view.length)
    return out

%timeit to_numpy_na(bool_na)
#> 162 µs ± 812 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit bool_pa.to_numpy(False)
#> 609 µs ± 833 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@paleolimbot paleolimbot marked this pull request as ready for review May 1, 2024 19:52
Copy link
Contributor

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool change - I can see there being a lot of uses for this

PyBuffer_Release(&buffer)
raise ValueError("Destination buffer has itemsize != 1")

if buffer.len < (dest_offset + length):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth also checking that dest_offset is non-negative? I think you'd get UB if a user decides to do something like unpack_bits_into(..., len(dest) + 1, -1) (not that they should...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks for the review!

@paleolimbot paleolimbot merged commit b2783d9 into apache:main May 2, 2024
6 checks passed
@paleolimbot paleolimbot deleted the python-unpack-bitmap branch May 2, 2024 17:01
paleolimbot added a commit that referenced this pull request May 10, 2024
This is the non-bitmap equivalent of #450, useful for the same purpose
(concatenating one big data buffer from chunks).
@paleolimbot paleolimbot added this to the nanoarrow 0.5.0 milestone May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants