feat(python): Add visitor pattern + builders for column sequences #454

paleolimbot · 2024-05-04T03:14:14Z

Assembling columns from chunked things is rather difficult to do and is a valid thing that somebody might want to assemble from Arrow data. This PR adds a "visitor" pattern that can be extended to build "column"s, which are currently just list()s. Before trimming down this PR to a managable set of changes, I also implemented the visitor that concatenates data buffers for single data buffer types ( https://gist.github.com/paleolimbot/17263e38b5d97c770e44d33b11181eaf ), which will be needed for to_columns() to be used in any kind of serious way.

To support the "visitor" pattern, I moved some of the PyIterator-specific pieces into the PyIterator so that the visitor can re-use the relevant pieces of ArrayViewBaseIterator. This pattern also solves one of the problems I had when attempting a "repr" iterator, which is that I was trying to build something rather than iterate over it.

import nanoarrow as na
import pandas as pd
from nanoarrow import visitor

url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
array = na.ArrayStream.from_url(url).read_all()

# to_columns() doesn't (and won't) produce anything numpy or pandas-related
names, columns = visitor.to_columns(array)

# ..but lets data frames be built rather compactly
pd.DataFrame({k: v for k, v in zip(names, columns)})

danepitkin

Nice work!

danepitkin · 2024-05-10T19:53:54Z

python/src/nanoarrow/iterator.py

+                iterator._set_array(array)
+                yield from iterator
+
+    def __init__(self, schema, *, _array_view=None):


nit: _array_view -> array_view as a param name (also applies to the base class)

danepitkin · 2024-05-10T20:04:21Z

python/src/nanoarrow/visitor.py

+from nanoarrow.schema import Type
+
+
+def to_pylist(obj, schema=None) -> List:


I think my personal preference would be to have to_pylist and to_columns be class APIs instead of helper functions. e.g.

>>> import nanoarrow as na >>> array = na.c_array([1, 2, 3], na.int32()) >>> array.to_pylist()

You could put these two apis in an abstract base class and add the base class to other classes. WDYT?

Probably all the visitors and iterators would be nice in an abstract base class shared between the Array and ArrayStream! I added the user-facing bit here to ensure that na.Array(...).to_pylist() is what users type and if there comes a time we have to start adding documentation in multiple places (very possible with to_columns()) we can revisit the base class.

danepitkin · 2024-05-10T20:13:25Z

python/src/nanoarrow/visitor.py

+        chunks have been visited. If the total number of elements
+        (i.e., the sum of all chunk lengths) is known, it is provided here.
+        """
+        pass


Maybe raise NotImplementedError() here? Alternatively, you could also look into https://docs.python.org/3/library/abc.html and abstract methods.

It is rare, but valid, that an implementation of a visitor does not need to do anything at one or more of begin, visit, and finish() (e.g., print chunks, time the consumption process). If/when this is public it's definitely worth revisiting how best to communicate this (e.g., maybe force implementations to explicitly do nothing for clarity).

danepitkin · 2024-05-10T20:14:39Z

python/src/nanoarrow/visitor.py

+
+
+class ListBuilder(ArrayStreamVisitor):
+    def __init__(self, schema, *, iterator_cls=PyIterator, _array_view=None):


nit: _array_view -> array_view in ListBuilder/ColumnsBuilder

Good call! I'd used _array_view because it's sort of internal; however, the whole API is currently internal. If/when it's made public there should perhaps be a better system for instantiating/efficiently reusing child array views that does not stick out like this does.

danepitkin · 2024-05-10T20:21:58Z

python/tests/test_visitor.py

+
+
+def test_to_pylist():
+    assert visitor.to_pylist([1, 2, 3], na.int32()) == [1, 2, 3]


This feels like an odd test because the input and output are technically both pylists. Would it be better to build an array and use that?

danepitkin

LGTM! Nice updates!

jorisvandenbossche · 2024-05-14T12:48:05Z

python/src/nanoarrow/array.py

+        """Convert this Array to a ``list()` of sequences
+
+        Converts a stream of struct arrays into its column-wise representation
+        such that each column is either a contiguous buffer or a ``list()``.


I don't fully understand the "or" here. When would it be a contiguous buffer and when a list? (my interpretation is that it is always a python list?)

And would it make more sense to not yet convert to a list? (the user can always call to_pylist on the resulting column values if they want a list) Because right now you cannot use this method if you don't want the python list for the actual values, but still want a list of columns?

See #464 ! (I think I copied the help text from a previous PR where I'd implemented both before splitting it up 😬 )

jorisvandenbossche · 2024-05-14T13:03:37Z

python/src/nanoarrow/visitor.py

+    Computes an identical value to ``list(iterator.iter_py())`` but is several
+    times faster.


What is actually the reason that it is multiple times faster? As I would expect it uses the same iteration code under the hood?

I have no idea! "Several times" is probably too strong here. The iteration code is slightly different, though (out_list.extend(array_iter) in a loop vs list(yield from array_iter).

import nanoarrow as na import numpy as np big = na.Array(np.random.random(int(1e6))) %timeit big.to_pylist() #> 19.8 ms ± 586 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit list(big.iter_py()) #> 50.6 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Looking at the code, the main difference seems to be that the faster one reuses the array_view argument to the iterator constructor? (although that shouldn't make any difference for a case with one chunk like the above?)

I believe that they both reuse the array_view...it seems to be yield from but I'm not an expert.

import numpy as np list_of_lists = [] for i in range(1000): list_of_lists.append(list(np.random.random(1000))) def iter_all(): for item in list_of_lists: yield from item def extend_all(): out = [] for item in list_of_lists: out.extend(item) return out assert list(iter_all()) == extend_all() %timeit list(iter_all()) #> 34.5 ms ± 522 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit extend_all() #> 4.35 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

paleolimbot marked this pull request as ready for review May 6, 2024 15:46

paleolimbot force-pushed the python-array-round2 branch 2 times, most recently from d7ffb77 to 6e5d45d Compare May 8, 2024 20:15

danepitkin reviewed May 10, 2024

View reviewed changes

paleolimbot and others added 22 commits May 13, 2024 10:39

add visitor

262d695

move the visitor

d895c53

more visitor things

12568c4

better

3c5750b

move recursiveness to PyIterator

c80a10c

progress

b76ec61

better visitor

dc40e32

test buffer concatenator

a1f5b21

test unpacked bitmap

207f35f

more testing

828cb5a

with tests

221eaee

fix columns

12978f4

format

14a4cc1

undo lib change

b27eb26

simplify

a99c3bb

simplify even more

741556e

document

6c0a92b

fix test

5c83081

fix for merge

eb3b733

_array_view -> array_view

d0284e1

add methods

0ae3e9a

fix test

ac9027f

paleolimbot force-pushed the python-array-round2 branch from cd4c4cb to ac9027f Compare May 13, 2024 13:40

test array stream methods

fccbde5

danepitkin approved these changes May 13, 2024

View reviewed changes

paleolimbot merged commit 490b980 into apache:main May 13, 2024
6 checks passed

paleolimbot deleted the python-array-round2 branch May 13, 2024 14:59

jorisvandenbossche reviewed May 14, 2024

View reviewed changes

paleolimbot added this to the nanoarrow 0.5.0 milestone May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Add visitor pattern + builders for column sequences #454

feat(python): Add visitor pattern + builders for column sequences #454

paleolimbot commented May 4, 2024 •

edited

Loading

danepitkin left a comment

danepitkin May 10, 2024

danepitkin May 10, 2024

paleolimbot May 13, 2024

danepitkin May 10, 2024

paleolimbot May 13, 2024

danepitkin May 10, 2024

paleolimbot May 13, 2024

danepitkin May 10, 2024

paleolimbot May 13, 2024

danepitkin left a comment

jorisvandenbossche May 14, 2024

paleolimbot May 14, 2024

jorisvandenbossche May 14, 2024

paleolimbot May 14, 2024

jorisvandenbossche May 14, 2024

paleolimbot May 14, 2024

		from nanoarrow.schema import Type


		def to_pylist(obj, schema=None) -> List:



		class ListBuilder(ArrayStreamVisitor):
		def __init__(self, schema, *, iterator_cls=PyIterator, _array_view=None):



		def test_to_pylist():
		assert visitor.to_pylist([1, 2, 3], na.int32()) == [1, 2, 3]

		Computes an identical value to ``list(iterator.iter_py())`` but is several
		times faster.

feat(python): Add visitor pattern + builders for column sequences #454

feat(python): Add visitor pattern + builders for column sequences #454

Conversation

paleolimbot commented May 4, 2024 • edited Loading

danepitkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danepitkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented May 4, 2024 •

edited

Loading