Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Python schema, array, and array view skeleton #117

Merged
merged 52 commits into from
Jun 15, 2023

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Feb 23, 2023

This PR is an attempt to add minimum usable Python bindings to the nanoarrow C library. That minimum scope is essentially just the ability to extract field values from ArrowSchema/ArrowArray/ArrowArrayStream objects in a way that will not crash Python. This PR also includes bindings for nanoarrow's ArrowSchemaView (so that the parameters of parameterized types can be extracted) and ArrowArrayView (so that buffer types/sizes can be exported using the Python buffer protocol).

I've updated the README to showcase the extent of the bindings as implemented in this PR; several basic examples are also provided below.

Example schema usage:

import nanoarrow as na
import pyarrow as pa
schema = na.schema(pa.decimal128(10, 3))
print(schema.format)
#> d:10,3
print(schema.view().decimal_precision)
#> 10
print(schema.view().decimal_scale)
#> 3

Example Array usage:

array = na.array(pa.array(["one", "two", "three", None]))
print(array.length)
#> 4
print(array.null_count)
#> 1

import numpy as np
view = array.view()
[np.array(buffer) for buffer in view.buffers]
#> [array([7], dtype=uint8),
#>  array([ 0,  3,  6, 11, 11], dtype=int32),
#>  array([b'o', b'n', b'e', b't', b'w', b'o', b't', b'h', b'r', b'e', b'e'],
#>        dtype='|S1')]

Example ArrayStream usage:

pa_array_child = pa.array([1, 2, 3], pa.int32())
pa_array = pa.record_batch([pa_array_child], names=["some_column"])
reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])
array_stream = na.array_stream(reader)

print(array_stream.get_schema())
#> struct<some_column: int32>

for array in array_stream:
    print(array.length)
#> 3

print(array_stream.get_next() is None)
#> True

@codecov-commenter
Copy link

codecov-commenter commented Mar 9, 2023

Codecov Report

Merging #117 (6f530e8) into main (c738f90) will decrease coverage by 0.01%.
The diff coverage is 87.68%.

@@            Coverage Diff             @@
##             main     #117      +/-   ##
==========================================
- Coverage   87.65%   87.64%   -0.01%     
==========================================
  Files          60       63       +3     
  Lines        9445     9788     +343     
==========================================
+ Hits         8279     8579     +300     
- Misses       1166     1209      +43     
Impacted Files Coverage Δ
src/nanoarrow/nanoarrow_types.h 92.75% <ø> (ø)
python/nanoarrow/_lib.pyx 87.57% <87.57%> (ø)
python/nanoarrow/lib.py 88.00% <88.00%> (ø)
python/nanoarrow/__init__.py 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@paleolimbot paleolimbot marked this pull request as ready for review March 9, 2023 16:08
@paleolimbot paleolimbot changed the title feat(python): Python schema wrapper feat(python): Python schema, array, and array view skeleton Mar 9, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just a few quick comments / questions

@@ -18,7 +18,8 @@

src/nanoarrow/nanoarrow.c
src/nanoarrow/nanoarrow.h
src/nanoarrow/*.cpp
src/nanoarrow/nanoarrow_c.pxd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option could be to actually include / commit this file? Although that might be annoying if a non-python related PR updates the c files, this gets out of sync (if we only update it in PRs that touch python). And in the end maybe not much different than the .c and .h files we also ignore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a pretty useful file to have on hand...nanoarrow.c/h are built for vendoring and there's no reason another Python package couldn't use the same .pxd (provided it's generated from the correct nanoarrow.h).

Maybe it could get checked into the dist/ folder next to the bundled nanoarrow/h?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's also a good idea, if we are OK with having a python specific file in that directory. Then the automated bot could generate it together with the .c and .h file?

For local development, doesn't have that the same potential drawback of not using the correct one? (if you edited the c sources) Although the same is true currently for the .c and .h files, so I suppose the idea is to still keep the cmake build step to ensure you use the latest for local development?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking it into dist/ would be fine because users of that would just copy the whole folder into their python project and the .h/.pxd would stay together?

For local development and CI it's essential to have cmake generate the bundle (or else you could make changes in the C library that break the Python package and CI wouldn't catch it). Also important to include copy from ../dist as a fallback so that installing by URL works via pip. I tested this on Mac and Windows and Linux but it probably needs some flushing out.

items = [item for item in self.re_struct_delim.split(body) if item]

cython_body = f'\n{indent} '.join([''] + items)
return f'{indent}cdef {type} {name}:{cython_body}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking the cdef is unnecessary here (and the same for func below)

The cdef extern from .. ensures cython already assumes this

python/setup.py Outdated
# Run bootstrap.py to run cmake generating a fresh bundle based on this
# checkout or copy from ../dist if the caller doesn't have cmake available
this_dir = os.path.dirname(__file__)
subprocess.run([sys.executable, os.path.join(this_dir, 'bootstrap.py')])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe move the full content of bootstrap.py into setup.py, and then there is no need to call it here with a subprocess.
In the end, bootstrap.py is ran every time we build/install the project, and that is essentially the logic that typically belongs in setup.py (I also don't think there is a need to be able to call the bootstrap separately?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually (later), we will also want to have some better logic to determine if this step needs to be run or not, for example when distributing sdists (see eg https://github.com/apache/arrow-adbc/blob/923e0408fe5a32cc6501b997fafa8316ace25fe0/python/adbc_driver_manager/setup.py#L33-L50)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bootstrap.R file too and I really like keep the logic separate since it's rather extensive and maybe should even be tested separately. I forgot about sdist...I made a few changes to make sure that (1) bootstrap.py wouldn't be included in sdist and (2) make sure not to error if it isn't there. Probably this will need some workshopping come 'we have to build 39348 wheels' time.

def _addr(self):
return <uintptr_t>&self.c_array_view

cdef class CSchema:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason you call it CSchema (with C prefix) because this represents a "C" Data Interface schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good reason! I forget where I got the CSomething pattern (maybe pyarrow?). After starting with it I thought it might make room for a small nanoarrow.Schema class that has better autocomplete/is safer (maybe whose @propertys are things like .type, .precision, .scale, .unit, etc. instead of the lower-level .format and friends). Maybe that is also planning for something that will never exist, too 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pyarrow we have the pattern of calling the exposed C(++) names with a "C" prefix, and then the corresponding cython classes without.
So the equivalent here would be to use CArrowSchema (instead of ArrowSchema) for nanoarrow.c one (and so for the declarations in the pxd file), and then ArrowSchema for the python class here.

I assume that what you describe (CSchema and Schema) can be done in a single class as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're totally right..it's all just Schema, Array, etc. now. There probably should never be anything higher level in nanoarrrow anyway!

def test_schema_basic():
# Blank invalid schema
schema = na.CSchema.Empty()
assert(schema.is_valid() is False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(schema.is_valid() is False)
assert schema.is_valid() is False

Styling nitpick: those brackets are not needed

(we should add black as pre-commit hook like in adbc, and that should also take care of this, I think: https://github.com/apache/arrow-adbc/blob/main/.pre-commit-config.yaml#L65)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated all the asserts...I might punt on setting up pre-commit-config (although it definitely should be done for more than just that!). I quite like how ADBC did it.

schema = na.CSchema.Empty()
pa.binary(12)._export_to_c(schema._addr())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to add a from_pyarrow-like helper to avoid this constant pattern of creating an empty and calling export_to_c (this could also be just an internal helper for the tests if we don't want to expose that publicly (yet))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to punt on pyarrow integration just this second...for the tests it's only one extra line to _export_to_c().

def version():
return ArrowNanoarrowVersion().decode("UTF-8")

cdef class CSchemaHolder:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those "Holder" classes are only for when we "own" the pointer, and then to ensure t is properly released? (to serve as the "base" object in that case, while if you create the CSchema/CArray class from another object that owns the pointer, this other object is the _base)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes...it could be a capsule or this object but is also usefully the parent array (so that you can do array.children[0] and re-use the CSchema class.

if result != NANOARROW_OK:
raise ValueError(ArrowErrorMessage(&error))

return CArrayView(holder, holder._addr(), self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds a bit strange that "validate" returns a array view object.

(I would personally also add all properties of CArrayView like length, buffers, etc to the CArray class itself)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bit of a rabbit hole but you're totally right. Now you can do array.view() and schema.view(), which is a bit more intuitive since you get a SchemaView and ArrayView back. I also completed the element accessors on the array and schema classes.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I think I mentioned it before, but personally I am not sure what the benefit is of the splits between the base class and the View class (eg Array and ArrayView).
I understand the reason that this separate structure exists in C nanoarrow, but here we are already creating a class wrapping the ArrowArray/Schema/Stream struct anyway. And from a python user perspective, I wouldn't really be able to explain why this is needed. For example, being able to do arr.buffers[1] is just simpler than arr.view().buffers[1] (also interactively, doing the view() makes that tab completion doesn't work nicely)

run: |
# Needs editable install to run --doctest-cython
pip install pytest-cython
pip install -e python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? (it was already installed in two steps above?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's installed without -e above (maybe I'll just add it there, and maybe also add pytest-cython to pyproject.toml while I'm at it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the pytest-cython need the package to be installed in editable mode? (I would expect it to work with the installed package as well)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locally, yes. I've found it to be rather finicky (but useful enough to play along with for now 🤷 ).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we use this package in Arrow CI with a normal install of pyarrow (non-editable), so I think it should certainly work. What you might have noticed locally is that a test didn't pass after updating it if you didn't re-install it? (that's the reason for locally using editable, so you don't have to reinstall after every tiny change. But on CI that doesn't matter)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this...there are many things that could have been wrong with my local setup!

python/nanoarrow/_lib.pyx Outdated Show resolved Hide resolved
python/nanoarrow/_lib.pyx Show resolved Hide resolved
return <uintptr_t>child


cdef class ArrayViewChildren:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to have both ArrayChildren and ArrayViewChildren? If you want to look at the children, you could also do arr.children[0].view() instead of arr.view().children[0]? Or does those two ways give you different capabilities?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slightly different: ArrowArrayViewSetArray() is recursive, so [x for x in array_view.children] involves fewer calculations than [x.view() for x in array.children]. I don't know if that matters (maybe faster if you had many many columns).

Comment on lines +459 to +460
def buffers(self):
return tuple(<uintptr_t>self._ptr.buffers[i] for i in range(self._ptr.n_buffers))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a useful attribute in practice?
(I would expect it to return essentially what self.view().buffers returns)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly just philosophically aligned with what the ArrowArray actually provides. If you're trying to debug something this would tell you if one of your pointers is NULL...to get the view().buffers the array would have to be validated against a Schema.

python/nanoarrow/_lib.pyx Outdated Show resolved Hide resolved
self._cached_schema = Schema.allocate()
self._get_schema(self._cached_schema)

# Return an independent copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also call the stream's get_schema once, and return here the cached version? (I would expect that get_schema return value can never change over time?)

Or is this mostly to mimic the behaviour of the ArrowArrayStream.get_schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I wanted to make sure that calling .get_schema() on the Python object definitely called .get_schema() on the C object, but also that calling .get_next() didn't call .get_schema() every single time. I don't know that the first use-case is ever going to come up, or perhaps I'm overemphasizing the "test your C code from Python" use case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found middle ground here...calling .get_next() always uses a cached schema but .get_schema() always gets you a fresh one.

@paleolimbot
Copy link
Member Author

I am not sure what the benefit is of the splits between the base class and the View class (eg Array and ArrayView).

I'm happy to workshop this more in subsequent PRs...it's a tiny bit unfortunate that there a number of overlapping names; the Array, Schema, and ArrayStream in nanoarrow are 1:1 aligned with the C data interface names (in the C library and in the R package, too). The Schema and Array in particular are poor analogues for the function of schemas and arrays elsewhere (notably the pyarrow or Arrow C++ concept of an Array). Here, it's "just" the thinnest possible wrapper around an ArrowArray*. The nanoarrow C library's ArrowArrayView functions more like an Array, as does the ArrayView here.

Notably, the ArrayView doesn't care where its buffer contents came from. The most logical way to populate its buffers is from an Array but you could also do it from serialized IPC (via tha IPC extension) or Python buffers or whatever. You can also modify the offset and length since it's not pointing to any particular ArrowArray that might be pointed to from elsewhere. Earlier versions of nanoarrow and of this PR didn't make that distinction evident and I've tried to do that here (e.g., by dropping every reference from the ArrayView class into the Array class).

There was also a note about using the src/ directory...I really did try to make it work with code coverage. The Cython coverage plugin is very finicky but given that most of this package is currently implemented in Cython, I think it's an important development tool to keep until I can figure out the proper pytest --cov invocation with the src/ directory version (or maybe somebody else can figure it out more quickly than I can).

There are definitely unfinished points here, but I do want to merge this because it (and its limited test suite) get a lot of the boilerplate out of the way so that the discussion can focus on more substantial topics when it does arise.

@paleolimbot paleolimbot merged commit 2f05e99 into apache:main Jun 15, 2023
@paleolimbot paleolimbot deleted the python-tidbits branch June 15, 2023 01:23
self,
<uintptr_t>self._ptr.dictionary,
self._schema.dictionary,
None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this None in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer shelter is None here because the child/dictionary's reference to self is sufficient to keep the buffers valid.

Comment on lines 735 to 766
self._length = array_view._array._ptr.n_buffers
self._length = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, the n_buffers member of ArrowArray is not always accurate (or not what we want here)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just to remove the reference to the _array, since that no longer exists. The ArrowArrayView in C doesn't have an explicit n_buffers member (but maybe should since the shenanigan on the following lines gets used quite a bit in C as well).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! And yes, seeing this code here, adding n_buffers to the view object might be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants