-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Python schema, array, and array view skeleton #117
Conversation
3d94cdd
to
8f888a9
Compare
Codecov Report
@@ Coverage Diff @@
## main #117 +/- ##
==========================================
- Coverage 87.65% 87.64% -0.01%
==========================================
Files 60 63 +3
Lines 9445 9788 +343
==========================================
+ Hits 8279 8579 +300
- Misses 1166 1209 +43
... and 2 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
cc8cc97
to
e313f28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Just a few quick comments / questions
python/.gitignore
Outdated
@@ -18,7 +18,8 @@ | |||
|
|||
src/nanoarrow/nanoarrow.c | |||
src/nanoarrow/nanoarrow.h | |||
src/nanoarrow/*.cpp | |||
src/nanoarrow/nanoarrow_c.pxd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option could be to actually include / commit this file? Although that might be annoying if a non-python related PR updates the c files, this gets out of sync (if we only update it in PRs that touch python). And in the end maybe not much different than the .c and .h files we also ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a pretty useful file to have on hand...nanoarrow.c/h are built for vendoring and there's no reason another Python package couldn't use the same .pxd (provided it's generated from the correct nanoarrow.h).
Maybe it could get checked into the dist/ folder next to the bundled nanoarrow/h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's also a good idea, if we are OK with having a python specific file in that directory. Then the automated bot could generate it together with the .c and .h file?
For local development, doesn't have that the same potential drawback of not using the correct one? (if you edited the c sources) Although the same is true currently for the .c and .h files, so I suppose the idea is to still keep the cmake build step to ensure you use the latest for local development?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking it into dist/ would be fine because users of that would just copy the whole folder into their python project and the .h/.pxd would stay together?
For local development and CI it's essential to have cmake generate the bundle (or else you could make changes in the C library that break the Python package and CI wouldn't catch it). Also important to include copy from ../dist as a fallback so that installing by URL works via pip. I tested this on Mac and Windows and Linux but it probably needs some flushing out.
python/bootstrap.py
Outdated
items = [item for item in self.re_struct_delim.split(body) if item] | ||
|
||
cython_body = f'\n{indent} '.join([''] + items) | ||
return f'{indent}cdef {type} {name}:{cython_body}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking the cdef
is unnecessary here (and the same for func below)
The cdef extern from ..
ensures cython already assumes this
python/setup.py
Outdated
# Run bootstrap.py to run cmake generating a fresh bundle based on this | ||
# checkout or copy from ../dist if the caller doesn't have cmake available | ||
this_dir = os.path.dirname(__file__) | ||
subprocess.run([sys.executable, os.path.join(this_dir, 'bootstrap.py')]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe move the full content of bootstrap.py
into setup.py
, and then there is no need to call it here with a subprocess.
In the end, bootstrap.py is ran every time we build/install the project, and that is essentially the logic that typically belongs in setup.py (I also don't think there is a need to be able to call the bootstrap separately?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually (later), we will also want to have some better logic to determine if this step needs to be run or not, for example when distributing sdists (see eg https://github.com/apache/arrow-adbc/blob/923e0408fe5a32cc6501b997fafa8316ace25fe0/python/adbc_driver_manager/setup.py#L33-L50)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bootstrap.R file too and I really like keep the logic separate since it's rather extensive and maybe should even be tested separately. I forgot about sdist...I made a few changes to make sure that (1) bootstrap.py wouldn't be included in sdist and (2) make sure not to error if it isn't there. Probably this will need some workshopping come 'we have to build 39348 wheels' time.
python/src/nanoarrow/_lib.pyx
Outdated
def _addr(self): | ||
return <uintptr_t>&self.c_array_view | ||
|
||
cdef class CSchema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason you call it CSchema (with C prefix) because this represents a "C" Data Interface schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No good reason! I forget where I got the CSomething pattern (maybe pyarrow?). After starting with it I thought it might make room for a small nanoarrow.Schema class that has better autocomplete/is safer (maybe whose @property
s are things like .type
, .precision
, .scale
, .unit
, etc. instead of the lower-level .format
and friends). Maybe that is also planning for something that will never exist, too 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pyarrow we have the pattern of calling the exposed C(++) names with a "C" prefix, and then the corresponding cython classes without.
So the equivalent here would be to use CArrowSchema (instead of ArrowSchema) for nanoarrow.c one (and so for the declarations in the pxd file), and then ArrowSchema for the python class here.
I assume that what you describe (CSchema and Schema) can be done in a single class as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're totally right..it's all just Schema
, Array
, etc. now. There probably should never be anything higher level in nanoarrrow anyway!
python/tests/test_nanoarrow.py
Outdated
def test_schema_basic(): | ||
# Blank invalid schema | ||
schema = na.CSchema.Empty() | ||
assert(schema.is_valid() is False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(schema.is_valid() is False) | |
assert schema.is_valid() is False |
Styling nitpick: those brackets are not needed
(we should add black as pre-commit hook like in adbc, and that should also take care of this, I think: https://github.com/apache/arrow-adbc/blob/main/.pre-commit-config.yaml#L65)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated all the asserts...I might punt on setting up pre-commit-config (although it definitely should be done for more than just that!). I quite like how ADBC did it.
python/tests/test_nanoarrow.py
Outdated
schema = na.CSchema.Empty() | ||
pa.binary(12)._export_to_c(schema._addr()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to add a from_pyarrow
-like helper to avoid this constant pattern of creating an empty and calling export_to_c (this could also be just an internal helper for the tests if we don't want to expose that publicly (yet))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to punt on pyarrow integration just this second...for the tests it's only one extra line to _export_to_c()
.
python/src/nanoarrow/_lib.pyx
Outdated
def version(): | ||
return ArrowNanoarrowVersion().decode("UTF-8") | ||
|
||
cdef class CSchemaHolder: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those "Holder" classes are only for when we "own" the pointer, and then to ensure t is properly released? (to serve as the "base" object in that case, while if you create the CSchema/CArray class from another object that owns the pointer, this other object is the _base
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes...it could be a capsule or this object but is also usefully the parent array (so that you can do array.children[0]
and re-use the CSchema
class.
python/src/nanoarrow/_lib.pyx
Outdated
if result != NANOARROW_OK: | ||
raise ValueError(ArrowErrorMessage(&error)) | ||
|
||
return CArrayView(holder, holder._addr(), self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds a bit strange that "validate" returns a array view object.
(I would personally also add all properties of CArrayView like length, buffers, etc to the CArray class itself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bit of a rabbit hole but you're totally right. Now you can do array.view()
and schema.view()
, which is a bit more intuitive since you get a SchemaView
and ArrayView
back. I also completed the element accessors on the array and schema classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I think I mentioned it before, but personally I am not sure what the benefit is of the splits between the base class and the View class (eg Array and ArrayView).
I understand the reason that this separate structure exists in C nanoarrow, but here we are already creating a class wrapping the ArrowArray/Schema/Stream struct anyway. And from a python user perspective, I wouldn't really be able to explain why this is needed. For example, being able to do arr.buffers[1]
is just simpler than arr.view().buffers[1]
(also interactively, doing the view()
makes that tab completion doesn't work nicely)
run: | | ||
# Needs editable install to run --doctest-cython | ||
pip install pytest-cython | ||
pip install -e python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? (it was already installed in two steps above?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's installed without -e
above (maybe I'll just add it there, and maybe also add pytest-cython
to pyproject.toml while I'm at it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the pytest-cython need the package to be installed in editable mode? (I would expect it to work with the installed package as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locally, yes. I've found it to be rather finicky (but useful enough to play along with for now 🤷 ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we use this package in Arrow CI with a normal install of pyarrow (non-editable), so I think it should certainly work. What you might have noticed locally is that a test didn't pass after updating it if you didn't re-install it? (that's the reason for locally using editable, so you don't have to reinstall after every tiny change. But on CI that doesn't matter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this...there are many things that could have been wrong with my local setup!
return <uintptr_t>child | ||
|
||
|
||
cdef class ArrayViewChildren: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need to have both ArrayChildren and ArrayViewChildren? If you want to look at the children, you could also do arr.children[0].view()
instead of arr.view().children[0]
? Or does those two ways give you different capabilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's slightly different: ArrowArrayViewSetArray()
is recursive, so [x for x in array_view.children]
involves fewer calculations than [x.view() for x in array.children]
. I don't know if that matters (maybe faster if you had many many columns).
def buffers(self): | ||
return tuple(<uintptr_t>self._ptr.buffers[i] for i in range(self._ptr.n_buffers)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a useful attribute in practice?
(I would expect it to return essentially what self.view().buffers
returns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly just philosophically aligned with what the ArrowArray
actually provides. If you're trying to debug something this would tell you if one of your pointers is NULL...to get the view().buffers
the array would have to be validated against a Schema
.
python/nanoarrow/_lib.pyx
Outdated
self._cached_schema = Schema.allocate() | ||
self._get_schema(self._cached_schema) | ||
|
||
# Return an independent copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also call the stream's get_schema once, and return here the cached version? (I would expect that get_schema return value can never change over time?)
Or is this mostly to mimic the behaviour of the ArrowArrayStream.get_schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I wanted to make sure that calling .get_schema()
on the Python object definitely called .get_schema()
on the C object, but also that calling .get_next()
didn't call .get_schema()
every single time. I don't know that the first use-case is ever going to come up, or perhaps I'm overemphasizing the "test your C code from Python" use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found middle ground here...calling .get_next()
always uses a cached schema but .get_schema()
always gets you a fresh one.
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
I'm happy to workshop this more in subsequent PRs...it's a tiny bit unfortunate that there a number of overlapping names; the Notably, the There was also a note about using the There are definitely unfinished points here, but I do want to merge this because it (and its limited test suite) get a lot of the boilerplate out of the way so that the discussion can focus on more substantial topics when it does arise. |
self, | ||
<uintptr_t>self._ptr.dictionary, | ||
self._schema.dictionary, | ||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this None in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer shelter is None here because the child/dictionary's reference to self
is sufficient to keep the buffers valid.
python/nanoarrow/_lib.pyx
Outdated
self._length = array_view._array._ptr.n_buffers | ||
self._length = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, the n_buffers
member of ArrowArray is not always accurate (or not what we want here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just to remove the reference to the _array
, since that no longer exists. The ArrowArrayView
in C doesn't have an explicit n_buffers
member (but maybe should since the shenanigan on the following lines gets used quite a bit in C as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! And yes, seeing this code here, adding n_buffers
to the view object might be useful.
Follow-up on #117, moving back to a src layout (https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/)
This PR is an attempt to add minimum usable Python bindings to the nanoarrow C library. That minimum scope is essentially just the ability to extract field values from
ArrowSchema
/ArrowArray
/ArrowArrayStream
objects in a way that will not crash Python. This PR also includes bindings for nanoarrow'sArrowSchemaView
(so that the parameters of parameterized types can be extracted) andArrowArrayView
(so that buffer types/sizes can be exported using the Python buffer protocol).I've updated the README to showcase the extent of the bindings as implemented in this PR; several basic examples are also provided below.
Example schema usage:
Example Array usage:
Example ArrayStream usage: