-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Add visitor pattern + builders for column sequences #454
Conversation
d7ffb77
to
6e5d45d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
python/src/nanoarrow/iterator.py
Outdated
iterator._set_array(array) | ||
yield from iterator | ||
|
||
def __init__(self, schema, *, _array_view=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _array_view
-> array_view
as a param name (also applies to the base class)
from nanoarrow.schema import Type | ||
|
||
|
||
def to_pylist(obj, schema=None) -> List: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my personal preference would be to have to_pylist
and to_columns
be class APIs instead of helper functions. e.g.
>>> import nanoarrow as na
>>> array = na.c_array([1, 2, 3], na.int32())
>>> array.to_pylist()
You could put these two apis in an abstract base class and add the base class to other classes. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably all the visitors and iterators would be nice in an abstract base class shared between the Array
and ArrayStream
! I added the user-facing bit here to ensure that na.Array(...).to_pylist()
is what users type and if there comes a time we have to start adding documentation in multiple places (very possible with to_columns()
) we can revisit the base class.
chunks have been visited. If the total number of elements | ||
(i.e., the sum of all chunk lengths) is known, it is provided here. | ||
""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe raise NotImplementedError()
here? Alternatively, you could also look into https://docs.python.org/3/library/abc.html and abstract methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is rare, but valid, that an implementation of a visitor does not need to do anything at one or more of begin
, visit
, and finish()
(e.g., print chunks, time the consumption process). If/when this is public it's definitely worth revisiting how best to communicate this (e.g., maybe force implementations to explicitly do nothing for clarity).
python/src/nanoarrow/visitor.py
Outdated
|
||
|
||
class ListBuilder(ArrayStreamVisitor): | ||
def __init__(self, schema, *, iterator_cls=PyIterator, _array_view=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _array_view
-> array_view
in ListBuilder/ColumnsBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! I'd used _array_view
because it's sort of internal; however, the whole API is currently internal. If/when it's made public there should perhaps be a better system for instantiating/efficiently reusing child array views that does not stick out like this does.
python/tests/test_visitor.py
Outdated
|
||
|
||
def test_to_pylist(): | ||
assert visitor.to_pylist([1, 2, 3], na.int32()) == [1, 2, 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like an odd test because the input and output are technically both pylists. Would it be better to build an array and use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call!
cd4c4cb
to
ac9027f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice updates!
"""Convert this Array to a ``list()` of sequences | ||
|
||
Converts a stream of struct arrays into its column-wise representation | ||
such that each column is either a contiguous buffer or a ``list()``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the "or" here. When would it be a contiguous buffer and when a list? (my interpretation is that it is always a python list?)
And would it make more sense to not yet convert to a list? (the user can always call to_pylist
on the resulting column values if they want a list) Because right now you cannot use this method if you don't want the python list for the actual values, but still want a list of columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #464 ! (I think I copied the help text from a previous PR where I'd implemented both before splitting it up 😬 )
Computes an identical value to ``list(iterator.iter_py())`` but is several | ||
times faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is actually the reason that it is multiple times faster? As I would expect it uses the same iteration code under the hood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea! "Several times" is probably too strong here. The iteration code is slightly different, though (out_list.extend(array_iter) in a loop
vs list(yield from array_iter)
.
import nanoarrow as na
import numpy as np
big = na.Array(np.random.random(int(1e6)))
%timeit big.to_pylist()
#> 19.8 ms ± 586 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(big.iter_py())
#> 50.6 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code, the main difference seems to be that the faster one reuses the array_view
argument to the iterator constructor? (although that shouldn't make any difference for a case with one chunk like the above?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that they both reuse the array_view...it seems to be yield from
but I'm not an expert.
import numpy as np
list_of_lists = []
for i in range(1000):
list_of_lists.append(list(np.random.random(1000)))
def iter_all():
for item in list_of_lists:
yield from item
def extend_all():
out = []
for item in list_of_lists:
out.extend(item)
return out
assert list(iter_all()) == extend_all()
%timeit list(iter_all())
#> 34.5 ms ± 522 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit extend_all()
#> 4.35 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Assembling columns from chunked things is rather difficult to do and is a valid thing that somebody might want to assemble from Arrow data. This PR adds a "visitor" pattern that can be extended to build "column"s, which are currently just
list()
s. Before trimming down this PR to a managable set of changes, I also implemented the visitor that concatenates data buffers for single data buffer types ( https://gist.github.com/paleolimbot/17263e38b5d97c770e44d33b11181eaf ), which will be needed forto_columns()
to be used in any kind of serious way.To support the "visitor" pattern, I moved some of the
PyIterator
-specific pieces into thePyIterator
so that the visitor can re-use the relevant pieces ofArrayViewBaseIterator
. This pattern also solves one of the problems I had when attempting a "repr" iterator, which is that I was trying to build something rather than iterate over it.