feat(python): Python schema, array, and array view skeleton #117

paleolimbot · 2023-02-23T19:07:35Z

This PR is an attempt to add minimum usable Python bindings to the nanoarrow C library. That minimum scope is essentially just the ability to extract field values from ArrowSchema/ArrowArray/ArrowArrayStream objects in a way that will not crash Python. This PR also includes bindings for nanoarrow's ArrowSchemaView (so that the parameters of parameterized types can be extracted) and ArrowArrayView (so that buffer types/sizes can be exported using the Python buffer protocol).

I've updated the README to showcase the extent of the bindings as implemented in this PR; several basic examples are also provided below.

Example schema usage:

import nanoarrow as na
import pyarrow as pa
schema = na.schema(pa.decimal128(10, 3))
print(schema.format)
#> d:10,3
print(schema.view().decimal_precision)
#> 10
print(schema.view().decimal_scale)
#> 3

Example Array usage:

array = na.array(pa.array(["one", "two", "three", None]))
print(array.length)
#> 4
print(array.null_count)
#> 1

import numpy as np
view = array.view()
[np.array(buffer) for buffer in view.buffers]
#> [array([7], dtype=uint8),
#>  array([ 0,  3,  6, 11, 11], dtype=int32),
#>  array([b'o', b'n', b'e', b't', b'w', b'o', b't', b'h', b'r', b'e', b'e'],
#>        dtype='|S1')]

Example ArrayStream usage:

pa_array_child = pa.array([1, 2, 3], pa.int32())
pa_array = pa.record_batch([pa_array_child], names=["some_column"])
reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])
array_stream = na.array_stream(reader)

print(array_stream.get_schema())
#> struct<some_column: int32>

for array in array_stream:
    print(array.length)
#> 3

print(array_stream.get_next() is None)
#> True

codecov-commenter · 2023-03-09T02:45:18Z

Codecov Report

Merging #117 (6f530e8) into main (c738f90) will decrease coverage by 0.01%.
The diff coverage is 87.68%.

@@            Coverage Diff             @@
##             main     #117      +/-   ##
==========================================
- Coverage   87.65%   87.64%   -0.01%     
==========================================
  Files          60       63       +3     
  Lines        9445     9788     +343     
==========================================
+ Hits         8279     8579     +300     
- Misses       1166     1209      +43

Impacted Files	Coverage Δ
src/nanoarrow/nanoarrow_types.h	`92.75% <ø> (ø)`
python/nanoarrow/_lib.pyx	`87.57% <87.57%> (ø)`
python/nanoarrow/lib.py	`88.00% <88.00%> (ø)`
python/nanoarrow/__init__.py	`100.00% <100.00%> (ø)`

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

jorisvandenbossche

Nice! Just a few quick comments / questions

jorisvandenbossche · 2023-03-09T19:51:56Z

python/.gitignore

@@ -18,7 +18,8 @@

 src/nanoarrow/nanoarrow.c
 src/nanoarrow/nanoarrow.h
-src/nanoarrow/*.cpp
+src/nanoarrow/nanoarrow_c.pxd


Another option could be to actually include / commit this file? Although that might be annoying if a non-python related PR updates the c files, this gets out of sync (if we only update it in PRs that touch python). And in the end maybe not much different than the .c and .h files we also ignore

It is a pretty useful file to have on hand...nanoarrow.c/h are built for vendoring and there's no reason another Python package couldn't use the same .pxd (provided it's generated from the correct nanoarrow.h).

Maybe it could get checked into the dist/ folder next to the bundled nanoarrow/h?

Ah, that's also a good idea, if we are OK with having a python specific file in that directory. Then the automated bot could generate it together with the .c and .h file?

For local development, doesn't have that the same potential drawback of not using the correct one? (if you edited the c sources) Although the same is true currently for the .c and .h files, so I suppose the idea is to still keep the cmake build step to ensure you use the latest for local development?

Checking it into dist/ would be fine because users of that would just copy the whole folder into their python project and the .h/.pxd would stay together?

For local development and CI it's essential to have cmake generate the bundle (or else you could make changes in the C library that break the Python package and CI wouldn't catch it). Also important to include copy from ../dist as a fallback so that installing by URL works via pip. I tested this on Mac and Windows and Linux but it probably needs some flushing out.

jorisvandenbossche · 2023-03-09T19:53:09Z

python/bootstrap.py

+            items = [item for item in self.re_struct_delim.split(body) if item]
+
+        cython_body = f'\n{indent}    '.join([''] + items)
+        return f'{indent}cdef {type} {name}:{cython_body}'


Strictly speaking the cdef is unnecessary here (and the same for func below)

The cdef extern from .. ensures cython already assumes this

jorisvandenbossche · 2023-03-09T19:58:07Z

python/setup.py

+# Run bootstrap.py to run cmake generating a fresh bundle based on this
+# checkout or copy from ../dist if the caller doesn't have cmake available
+this_dir = os.path.dirname(__file__)
+subprocess.run([sys.executable, os.path.join(this_dir, 'bootstrap.py')])


I would maybe move the full content of bootstrap.py into setup.py, and then there is no need to call it here with a subprocess.
In the end, bootstrap.py is ran every time we build/install the project, and that is essentially the logic that typically belongs in setup.py (I also don't think there is a need to be able to call the bootstrap separately?)

Eventually (later), we will also want to have some better logic to determine if this step needs to be run or not, for example when distributing sdists (see eg https://github.com/apache/arrow-adbc/blob/923e0408fe5a32cc6501b997fafa8316ace25fe0/python/adbc_driver_manager/setup.py#L33-L50)

There's a bootstrap.R file too and I really like keep the logic separate since it's rather extensive and maybe should even be tested separately. I forgot about sdist...I made a few changes to make sure that (1) bootstrap.py wouldn't be included in sdist and (2) make sure not to error if it isn't there. Probably this will need some workshopping come 'we have to build 39348 wheels' time.

jorisvandenbossche · 2023-03-09T20:24:16Z

python/src/nanoarrow/_lib.pyx

+    def _addr(self):
+        return <uintptr_t>&self.c_array_view
+
+cdef class CSchema:


Is the reason you call it CSchema (with C prefix) because this represents a "C" Data Interface schema?

No good reason! I forget where I got the CSomething pattern (maybe pyarrow?). After starting with it I thought it might make room for a small nanoarrow.Schema class that has better autocomplete/is safer (maybe whose @propertys are things like .type, .precision, .scale, .unit, etc. instead of the lower-level .format and friends). Maybe that is also planning for something that will never exist, too 🤷

In pyarrow we have the pattern of calling the exposed C(++) names with a "C" prefix, and then the corresponding cython classes without.
So the equivalent here would be to use CArrowSchema (instead of ArrowSchema) for nanoarrow.c one (and so for the declarations in the pxd file), and then ArrowSchema for the python class here.

I assume that what you describe (CSchema and Schema) can be done in a single class as well?

You're totally right..it's all just Schema, Array, etc. now. There probably should never be anything higher level in nanoarrrow anyway!

jorisvandenbossche · 2023-03-09T20:38:00Z

python/tests/test_nanoarrow.py

+def test_schema_basic():
+    # Blank invalid schema
+    schema = na.CSchema.Empty()
+    assert(schema.is_valid() is False)


Suggested change

assert(schema.is_valid() is False)

assert schema.is_valid() is False

Styling nitpick: those brackets are not needed

(we should add black as pre-commit hook like in adbc, and that should also take care of this, I think: https://github.com/apache/arrow-adbc/blob/main/.pre-commit-config.yaml#L65)

I updated all the asserts...I might punt on setting up pre-commit-config (although it definitely should be done for more than just that!). I quite like how ADBC did it.

jorisvandenbossche · 2023-03-09T20:42:17Z

python/tests/test_nanoarrow.py

+    schema = na.CSchema.Empty()
+    pa.binary(12)._export_to_c(schema._addr())


It might be nice to add a from_pyarrow-like helper to avoid this constant pattern of creating an empty and calling export_to_c (this could also be just an internal helper for the tests if we don't want to expose that publicly (yet))

I'm going to punt on pyarrow integration just this second...for the tests it's only one extra line to _export_to_c().

jorisvandenbossche · 2023-03-09T20:44:43Z

python/src/nanoarrow/_lib.pyx

+def version():
+    return ArrowNanoarrowVersion().decode("UTF-8")
+
+cdef class CSchemaHolder:


Those "Holder" classes are only for when we "own" the pointer, and then to ensure t is properly released? (to serve as the "base" object in that case, while if you create the CSchema/CArray class from another object that owns the pointer, this other object is the _base)

Yes...it could be a capsule or this object but is also usefully the parent array (so that you can do array.children[0] and re-use the CSchema class.

jorisvandenbossche · 2023-03-09T20:46:58Z

python/src/nanoarrow/_lib.pyx

+        if result != NANOARROW_OK:
+            raise ValueError(ArrowErrorMessage(&error))
+
+        return CArrayView(holder, holder._addr(), self)


It sounds a bit strange that "validate" returns a array view object.

(I would personally also add all properties of CArrayView like length, buffers, etc to the CArray class itself)

This was a bit of a rabbit hole but you're totally right. Now you can do array.view() and schema.view(), which is a bit more intuitive since you get a SchemaView and ArrayView back. I also completed the element accessors on the array and schema classes.

jorisvandenbossche

Nice!

I think I mentioned it before, but personally I am not sure what the benefit is of the splits between the base class and the View class (eg Array and ArrayView).
I understand the reason that this separate structure exists in C nanoarrow, but here we are already creating a class wrapping the ArrowArray/Schema/Stream struct anyway. And from a python user perspective, I wouldn't really be able to explain why this is needed. For example, being able to do arr.buffers[1] is just simpler than arr.view().buffers[1] (also interactively, doing the view() makes that tab completion doesn't work nicely)

jorisvandenbossche · 2023-06-14T10:03:00Z

.github/workflows/python.yaml

+        run: |
+          # Needs editable install to run --doctest-cython
+          pip install pytest-cython
+          pip install -e python


Is this needed? (it was already installed in two steps above?)

I think it's installed without -e above (maybe I'll just add it there, and maybe also add pytest-cython to pyproject.toml while I'm at it).

Does the pytest-cython need the package to be installed in editable mode? (I would expect it to work with the installed package as well)

Locally, yes. I've found it to be rather finicky (but useful enough to play along with for now 🤷 ).

FWIW, we use this package in Arrow CI with a normal install of pyarrow (non-editable), so I think it should certainly work. What you might have noticed locally is that a test didn't pass after updating it if you didn't re-install it? (that's the reason for locally using editable, so you don't have to reinstall after every tiny change. But on CI that doesn't matter)

I'll update this...there are many things that could have been wrong with my local setup!

python/nanoarrow/_lib.pyx

jorisvandenbossche · 2023-06-14T13:00:22Z

python/nanoarrow/_lib.pyx

+        return <uintptr_t>child
+
+
+cdef class ArrayViewChildren:


Is there a need to have both ArrayChildren and ArrayViewChildren? If you want to look at the children, you could also do arr.children[0].view() instead of arr.view().children[0]? Or does those two ways give you different capabilities?

It's slightly different: ArrowArrayViewSetArray() is recursive, so [x for x in array_view.children] involves fewer calculations than [x.view() for x in array.children]. I don't know if that matters (maybe faster if you had many many columns).

jorisvandenbossche · 2023-06-14T13:05:40Z

python/nanoarrow/_lib.pyx

+    def buffers(self):
+        return tuple(<uintptr_t>self._ptr.buffers[i] for i in range(self._ptr.n_buffers))


Is this a useful attribute in practice?
(I would expect it to return essentially what self.view().buffers returns)

Possibly just philosophically aligned with what the ArrowArray actually provides. If you're trying to debug something this would tell you if one of your pointers is NULL...to get the view().buffers the array would have to be validated against a Schema.

python/nanoarrow/_lib.pyx

jorisvandenbossche · 2023-06-14T13:09:50Z

python/nanoarrow/_lib.pyx

+        self._cached_schema = Schema.allocate()
+        self._get_schema(self._cached_schema)
+
+        # Return an independent copy


This could also call the stream's get_schema once, and return here the cached version? (I would expect that get_schema return value can never change over time?)

Or is this mostly to mimic the behaviour of the ArrowArrayStream.get_schema?

I think I wanted to make sure that calling .get_schema() on the Python object definitely called .get_schema() on the C object, but also that calling .get_next() didn't call .get_schema() every single time. I don't know that the first use-case is ever going to come up, or perhaps I'm overemphasizing the "test your C code from Python" use case.

I think I found middle ground here...calling .get_next() always uses a cached schema but .get_schema() always gets you a fresh one.

python/nanoarrow/_lib.pyx

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

paleolimbot · 2023-06-15T01:20:54Z

I am not sure what the benefit is of the splits between the base class and the View class (eg Array and ArrayView).

I'm happy to workshop this more in subsequent PRs...it's a tiny bit unfortunate that there a number of overlapping names; the Array, Schema, and ArrayStream in nanoarrow are 1:1 aligned with the C data interface names (in the C library and in the R package, too). The Schema and Array in particular are poor analogues for the function of schemas and arrays elsewhere (notably the pyarrow or Arrow C++ concept of an Array). Here, it's "just" the thinnest possible wrapper around an ArrowArray*. The nanoarrow C library's ArrowArrayView functions more like an Array, as does the ArrayView here.

Notably, the ArrayView doesn't care where its buffer contents came from. The most logical way to populate its buffers is from an Array but you could also do it from serialized IPC (via tha IPC extension) or Python buffers or whatever. You can also modify the offset and length since it's not pointing to any particular ArrowArray that might be pointed to from elsewhere. Earlier versions of nanoarrow and of this PR didn't make that distinction evident and I've tried to do that here (e.g., by dropping every reference from the ArrayView class into the Array class).

There was also a note about using the src/ directory...I really did try to make it work with code coverage. The Cython coverage plugin is very finicky but given that most of this package is currently implemented in Cython, I think it's an important development tool to keep until I can figure out the proper pytest --cov invocation with the src/ directory version (or maybe somebody else can figure it out more quickly than I can).

There are definitely unfinished points here, but I do want to merge this because it (and its limited test suite) get a lot of the boilerplate out of the way so that the discussion can focus on more substantial topics when it does arise.

jorisvandenbossche · 2023-06-15T19:07:32Z

python/nanoarrow/_lib.pyx

+                self,
+                <uintptr_t>self._ptr.dictionary,
+                self._schema.dictionary,
+                None


Why is this None in this case?

The buffer shelter is None here because the child/dictionary's reference to self is sufficient to keep the buffers valid.

jorisvandenbossche · 2023-06-15T19:18:07Z

python/nanoarrow/_lib.pyx

-        self._length = array_view._array._ptr.n_buffers
+        self._length = 3


Just out of curiosity, the n_buffers member of ArrowArray is not always accurate (or not what we want here)?

It's just to remove the reference to the _array, since that no longer exists. The ArrowArrayView in C doesn't have an explicit n_buffers member (but maybe should since the shenanigan on the following lines gets used quite a bit in C as well).

Thanks for the explanation! And yes, seeing this code here, adding n_buffers to the view object might be useful.

Follow-up on #117, moving back to a src layout (https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/)

paleolimbot mentioned this pull request Mar 8, 2023

PDEP-9: Allow third-party projects to register pandas connectors with a standard API pandas-dev/pandas#51799

Merged

paleolimbot force-pushed the python-tidbits branch from 3d94cdd to 8f888a9 Compare March 8, 2023 19:41

paleolimbot force-pushed the python-tidbits branch from cc8cc97 to e313f28 Compare March 9, 2023 15:51

paleolimbot marked this pull request as ready for review March 9, 2023 16:08

paleolimbot changed the title ~~feat(python): Python schema wrapper~~ feat(python): Python schema, array, and array view skeleton Mar 9, 2023

jorisvandenbossche reviewed Mar 9, 2023

View reviewed changes

paleolimbot and others added 23 commits June 9, 2023 16:27

maybe schema class

879d75e

remove pxds

cff939d

generate the nanoarrow pxd file

73eb934

completely invalid but working towards ideal setup.py

6153916

still invalid but better setup.py

1637afe

actually working bootstrap setup

1b38e9e

fix indentation

ca191e7

add some typedefs

891afb1

no void()

26581a5

try without namespace

8531234

better functionr regex

1bb020f

move some bits from the other repo

368ecbe

with recursiveness

5d0a500

passing buffer protocol test

a83e2e9

maybe fix errors

18f1a20

don't run bootstrap when installing from sdist

0cc14b8

make sure we can install from sdist

9e72b96

remove redundant cdefs

2cd5599

fix asserts

0144afe

maybe safer version check

b768724

drop C class prefix, drop numpy dependency, pythonize quotes

846c837

more complete schema field wrapping

ac304af

test a few more array view things

05d4a81

paleolimbot added 14 commits June 10, 2023 23:22

some helpers

97df28e

rename version

d7fd6e1

start on stream

5c02647

some array stream madness

ed5e545

some tidying

1321545

more documentation

c1c2df2

empty -> allocate

ee4dbb6

in theory use nanoarrowexcpetion

4bb8f86

attempt starting coverage

1d3d8c1

attempt starting coverage again

3d03cc6

remove coverage file

3413af1

add coverage to python job

4f8cef2

fix + test doctests from Cython

f6be55c

basic readme

ce3463f

jorisvandenbossche reviewed Jun 14, 2023

View reviewed changes

python/nanoarrow/_lib.pyx Outdated Show resolved Hide resolved

paleolimbot and others added 7 commits June 14, 2023 11:12

Update python/nanoarrow/_lib.pyx

78cd797

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

use the namespace

dc7d91c

use cinit instead of init

a1150dd

clean up Cython

b73804c

use StopIteration

a5d4479

more clear distinction from the array--array view interaction

6a82504

fix doctest

6f530e8

paleolimbot merged commit 2f05e99 into apache:main Jun 15, 2023

paleolimbot deleted the python-tidbits branch June 15, 2023 01:23

jorisvandenbossche reviewed Jun 15, 2023

View reviewed changes

jorisvandenbossche mentioned this pull request Jun 15, 2023

build(python): use src/ layout for python package #234

Merged

paleolimbot pushed a commit that referenced this pull request Jun 16, 2023

chore(python): Use src/ layout for python package (#234)

f92eb6a

Follow-up on #117, moving back to a src layout (https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/)

	assert(schema.is_valid() is False)
	assert schema.is_valid() is False

		schema = na.CSchema.Empty()
		pa.binary(12)._export_to_c(schema._addr())

		def buffers(self):
		return tuple(<uintptr_t>self._ptr.buffers[i] for i in range(self._ptr.n_buffers))

		self._length = array_view._array._ptr.n_buffers
		self._length = 3

feat(python): Python schema, array, and array view skeleton #117

feat(python): Python schema, array, and array view skeleton #117

Conversation

paleolimbot commented Feb 23, 2023 • edited Loading

codecov-commenter commented Mar 9, 2023 • edited Loading

Codecov Report

jorisvandenbossche left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Jun 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Feb 23, 2023 •

edited

Loading

codecov-commenter commented Mar 9, 2023 •

edited

Loading

jorisvandenbossche left a comment •

edited

Loading