Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Add visitor pattern + builders for column sequences #454

Merged
merged 23 commits into from
May 13, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented May 4, 2024

Assembling columns from chunked things is rather difficult to do and is a valid thing that somebody might want to assemble from Arrow data. This PR adds a "visitor" pattern that can be extended to build "column"s, which are currently just list()s. Before trimming down this PR to a managable set of changes, I also implemented the visitor that concatenates data buffers for single data buffer types ( https://gist.github.com/paleolimbot/17263e38b5d97c770e44d33b11181eaf ), which will be needed for to_columns() to be used in any kind of serious way.

To support the "visitor" pattern, I moved some of the PyIterator-specific pieces into the PyIterator so that the visitor can re-use the relevant pieces of ArrayViewBaseIterator. This pattern also solves one of the problems I had when attempting a "repr" iterator, which is that I was trying to build something rather than iterate over it.

import nanoarrow as na
import pandas as pd
from nanoarrow import visitor

url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
array = na.ArrayStream.from_url(url).read_all()

# to_columns() doesn't (and won't) produce anything numpy or pandas-related
names, columns = visitor.to_columns(array)

# ..but lets data frames be built rather compactly
pd.DataFrame({k: v for k, v in zip(names, columns)})

@paleolimbot paleolimbot marked this pull request as ready for review May 6, 2024 15:46
@paleolimbot paleolimbot force-pushed the python-array-round2 branch 2 times, most recently from d7ffb77 to 6e5d45d Compare May 8, 2024 20:15
Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

iterator._set_array(array)
yield from iterator

def __init__(self, schema, *, _array_view=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _array_view -> array_view as a param name (also applies to the base class)

from nanoarrow.schema import Type


def to_pylist(obj, schema=None) -> List:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my personal preference would be to have to_pylist and to_columns be class APIs instead of helper functions. e.g.

    >>> import nanoarrow as na
    >>> array = na.c_array([1, 2, 3], na.int32())
    >>> array.to_pylist()

You could put these two apis in an abstract base class and add the base class to other classes. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably all the visitors and iterators would be nice in an abstract base class shared between the Array and ArrayStream! I added the user-facing bit here to ensure that na.Array(...).to_pylist() is what users type and if there comes a time we have to start adding documentation in multiple places (very possible with to_columns()) we can revisit the base class.

chunks have been visited. If the total number of elements
(i.e., the sum of all chunk lengths) is known, it is provided here.
"""
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe raise NotImplementedError() here? Alternatively, you could also look into https://docs.python.org/3/library/abc.html and abstract methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is rare, but valid, that an implementation of a visitor does not need to do anything at one or more of begin, visit, and finish() (e.g., print chunks, time the consumption process). If/when this is public it's definitely worth revisiting how best to communicate this (e.g., maybe force implementations to explicitly do nothing for clarity).



class ListBuilder(ArrayStreamVisitor):
def __init__(self, schema, *, iterator_cls=PyIterator, _array_view=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _array_view -> array_view in ListBuilder/ColumnsBuilder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! I'd used _array_view because it's sort of internal; however, the whole API is currently internal. If/when it's made public there should perhaps be a better system for instantiating/efficiently reusing child array views that does not stick out like this does.



def test_to_pylist():
assert visitor.to_pylist([1, 2, 3], na.int32()) == [1, 2, 3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like an odd test because the input and output are technically both pylists. Would it be better to build an array and use that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice updates!

@paleolimbot paleolimbot merged commit 490b980 into apache:main May 13, 2024
6 checks passed
@paleolimbot paleolimbot deleted the python-array-round2 branch May 13, 2024 14:59
"""Convert this Array to a ``list()` of sequences

Converts a stream of struct arrays into its column-wise representation
such that each column is either a contiguous buffer or a ``list()``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the "or" here. When would it be a contiguous buffer and when a list? (my interpretation is that it is always a python list?)

And would it make more sense to not yet convert to a list? (the user can always call to_pylist on the resulting column values if they want a list) Because right now you cannot use this method if you don't want the python list for the actual values, but still want a list of columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #464 ! (I think I copied the help text from a previous PR where I'd implemented both before splitting it up 😬 )

Comment on lines +29 to +30
Computes an identical value to ``list(iterator.iter_py())`` but is several
times faster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is actually the reason that it is multiple times faster? As I would expect it uses the same iteration code under the hood?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea! "Several times" is probably too strong here. The iteration code is slightly different, though (out_list.extend(array_iter) in a loop vs list(yield from array_iter).

import nanoarrow as na
import numpy as np

big = na.Array(np.random.random(int(1e6)))
%timeit big.to_pylist()
#> 19.8 ms ± 586 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(big.iter_py())
#> 50.6 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, the main difference seems to be that the faster one reuses the array_view argument to the iterator constructor? (although that shouldn't make any difference for a case with one chunk like the above?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that they both reuse the array_view...it seems to be yield from but I'm not an expert.

import numpy as np

list_of_lists = []
for i in range(1000):
    list_of_lists.append(list(np.random.random(1000)))

def iter_all():
    for item in list_of_lists:
        yield from item
    
def extend_all():
    out = []
    for item in list_of_lists:
        out.extend(item)
    return out
        
assert list(iter_all()) == extend_all()

%timeit list(iter_all())
#> 34.5 ms ± 522 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit extend_all()
#> 4.35 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@paleolimbot paleolimbot added this to the nanoarrow 0.5.0 milestone May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants