-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Add column-wise buffer builder #464
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few quick comments!
python/src/nanoarrow/visitor.py
Outdated
@@ -49,7 +50,7 @@ def to_pylist(obj, schema=None) -> List: | |||
return ListBuilder.visit(obj, schema) | |||
|
|||
|
|||
def to_columns(obj, schema=None) -> Tuple[List[str], List[Sequence]]: | |||
def to_columns(obj, schema=None, handle_nulls=None) -> Tuple[List[str], List[Sequence]]: | |||
"""Convert ``obj`` to a ``list()` of sequences | |||
|
|||
Converts a stream of struct arrays into its column-wise representation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the line below, but can you then clarify when you get a buffer or when a list?
python/src/nanoarrow/visitor.py
Outdated
return handle | ||
|
||
|
||
def nulls_as_masked_array(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be more useful to actually return each array as a tuple of data and mask (the numpy masked array isn't used that much)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know! I'll remove that then 🙂
I wonder if it would be more useful to actually return each array as a tuple of data and mask
Currently that would be nulls_debug()
, mostly because I couldn't think of a good name (in a previous version it was nulls_as_mask_and_data()
). Returning a tuple here sort of breaks the "every column is a sequence" guarantee. There could also be a MaskedSequence
class that wraps the mask and data but that seems like opening a can of worms.
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! The Column builder code is quite nuanced, I need to do another pass over it.
from nanoarrow.iterator import ArrayViewBaseIterator, PyIterator | ||
from nanoarrow.schema import Type | ||
|
||
|
||
def to_pylist(obj, schema=None) -> List: | ||
"""Convert ``obj`` to a ``list()` of Python objects | ||
class ArrayViewVisitable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, love this!
python/src/nanoarrow/visitor.py
Outdated
|
||
>>> import nanoarrow as na | ||
>>> import pyarrow as pa | ||
>>> batch = pa.record_batch([pa.array([1, 2, 3])], names=["col1"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we use two columns in the docstrings? I think it would better show the functionality.
python/src/nanoarrow/visitor.py
Outdated
return handle | ||
|
||
|
||
def nulls_debug() -> Callable[[CBuffer, Sequence], Tuple[CBuffer, Sequence]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mark this as an internal API? e.g. _nulls_debug()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the name to nulls_separate()
and exported it...from @jorisvandenbossche's comment it sounds like it would be useful (in the situation where caller wants to handle nulls completely on their own).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nulls_separate
sounds in any case as a better name! (and yes I think it would be useful)
python/src/nanoarrow/visitor.py
Outdated
|
||
def begin(self, total_elements: Union[int, None] = None) -> None: | ||
for child_visitor in self._child_visitors: | ||
child_visitor.begin(total_elements) | ||
|
||
def visit_chunk_view(self, array_view: CArrayView) -> Any: | ||
if array_view.null_count > 0: | ||
raise ValueError("null_count > 0 encountered in ColumnsBuilder") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are any nulls here the column results will be bogus! (We have no way to propagate a top-level nulls through into the child columns). At the "record batch" level nulls don't exist (typically they are exported as an explicitly non-nullable struct; however, at least one implementation doesn't do this so we have to check for actual nulls whilst iterating instead of erroring when we see the schema).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a brief comment about this in the code?
|
||
# Infer null count == 0 because of null data buffer | ||
array = na.c_array_from_buffers( | ||
na.int32(), 3, (None, na.c_buffer([1, 2, 3], na.int32())), null_count=-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to add null_count=-1
in these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comments to each of these...-1 is the sentinel for "not computed yet". Technically it's the default but here it's exactly what we're testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First set of comments. And thanks for the nice docstrings!
python/src/nanoarrow/visitor.py
Outdated
handle_nulls : callable | ||
A function returning a sequence based on a validity bytemap and a | ||
contiguous buffer of values (e.g., the callable returned by | ||
:meth:`nulls_as_sentinel`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention that the default is nulls_forbid()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe also list the built-in options
|
||
Converts a stream of arrays into a columnar representation | ||
such that each column is either a contiguous buffer or a ``list()``. | ||
Integer, float, and interval arrays are currently converted to their |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason interval arrays are returned as buffer? (just because there is no obvious python object?)
And why are other primitive fixed-width types like timestamp not returned as buffers? (because they are not a c primitive type? And have an obvious python object to use?)
I can imagine that for certain purposes, you would actually want the integers behind a timestamp column, because that is much cheaper to work with than a list of python datetime.datetime objects.
Should we make this user configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason interval arrays are returned as buffer? (just because there is no obvious python object?)
The technical answer is that schema_view.buffer_format
returns "iiq"
(i.e., not None
)...I never really looked at what pyarrow does here but I see now that it returns a named tuple if converted to a list. The last time I tried to_numpy()
on a pyarrow interval I got a crash ( apache/arrow#41326 ). I'm pretty happy to do anything here (or make a breaking change later to handle it properly).
And why are other primitive fixed-width types like timestamp not returned as buffers?
If just the storage were returned it would be lossy (i.e., pd.Series()
would do the wrong thing). There's no way to communicate what to do here without invoking numpy or pandas-specific logic, so a list of Python objects is maybe a safer default.
Should we make this user configurable?
Totally (and also make Python object conversion configurable), but I'm not sure exactly how to do that yet. If somebody wanted to do this today, I'd suggest subclassing the visitor:
import nanoarrow as na
import pyarrow as pa
from datetime import datetime
from nanoarrow.visitor import ColumnsBuilder, NullableColumnBuilder
class CustomColumnsBuilder(ColumnsBuilder):
def _resolve_child_visitor(self, child_schema, child_array_view, handle_nulls):
if na.Schema(child_schema).type == na.Type.TIMESTAMP:
return NullableColumnBuilder(na.int64(), handle_nulls=handle_nulls)
else:
return super()._resolve_child_visitor(
child_schema, child_array_view, handle_nulls
)
batch = pa.record_batch({"ts": [datetime.now()]})
CustomColumnsBuilder.visit(batch)
#> (['ts'], [nanoarrow.c_lib.CBuffer(int64[8 b] 1715862463542238)])
python/src/nanoarrow/visitor.py
Outdated
schema : schema-like, optional | ||
An optional schema, passed to :func:`c_array_stream`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not actually present in the signature?
python/src/nanoarrow/visitor.py
Outdated
if len(is_valid) > 0: | ||
raise ValueError("Null present with null_handler=nulls_forbid()") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not because there is a validity buffer that it includes any nulls? Or did you ensure that before? (eg not convert the bitmap to bytemap if the bitmap is all set?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes...the bool buffer indicating validity is never allocated unless a null is actually present (and is None otherwise). I added this to the handle_nulls
documentation!
python/src/nanoarrow/visitor.py
Outdated
The default sentinel value will result in ``nan`` assigned to null | ||
values in numeric and boolean outputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure this is the case for boolean? (setting None into a boolean array sets False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but you ensure to convert int/bool to float in case there are nulls, so that setting as NaN indeed works correctly. Maybe clarify that here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a bit of a side-effect of how numpy handles None
with result_type()
. I added a comment clarifying this.
import numpy as np
sentinel = None
arr = np.array([True, False, True, True], np.result_type(sentinel, np.bool_))
arr[2] = sentinel
arr
#> array([ 1., 0., nan, 1.])
python/src/nanoarrow/visitor.py
Outdated
|
||
>>> import nanoarrow as na | ||
>>> na.Array([1, 2, 3], na.int32()).to_column(na.nulls_separate()) | ||
(nanoarrow.c_lib.CBuffer(uint8[0 b] ), nanoarrow.c_lib.CBuffer(int32[12 b] 1 2 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return None
here instead of an empty buffer? Because if you convert both outputs to numpy arrays, you can have numpy arrays of different length (of course, it's not that one of checking for None or for empty / len 0 is much harder than the other)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I updated this to return None
, which is less ambiguous. One could also update this such that handle_nulls()
is never called unless there are actually nulls?
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
python/src/nanoarrow/visitor.py
Outdated
def finish(self) -> List: | ||
return self._lst | ||
def finish(self) -> Any: | ||
return self._visitor.finish() | ||
|
||
|
||
class ColumnsBuilder(ArrayStreamVisitor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, and I don't directly have a better idea, but I found the usage of "columns" somewhat confusing here (and also "Builder", because that sounds to similar as the builders to build arrays/buffers in the python->arrow direction). Naming is hard .. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been through a few iterations of naming 🙂 . Maybe sequence
is better than column
? The gist of it is "something that either numpy or Pandas can understand" and/or "we dealt with all the hard parts of a non-contiguous or non-zero offset array so you didn't have to". I could stick with Visitor
instead of Builder
, too (but these are internal for now so at least that part can get renamed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that I also had a bit of trouble comprehending the naming. I think it might be worthwhile to rename "column" to something else. Technically, its just a contiguous array, but array
is overloaded. Are these technically all concatenation helpers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At Dane's suggestion these are now to_pysequence()
and to_columns_pysequence()
. I think these can/should get renamed when we see exactly how/if they are used (it may be that everybody will want custom logic and the high-level convert method will get unused).
:func:`c_array_stream`. | ||
schema : schema-like, optional | ||
An optional schema, passed to :func:`c_array_stream`. | ||
handle_nulls : callable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This keyword is only used when the output is a buffer and not when it is a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and only for explicitly nullable fields. Maybe there should be a handle_valid
and handle_nulls
? Or maybe just a handle_result()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept this as is for now but I think there will eventually need to be some way to inject more customization into the process (including exactly when the null handler gets called).
python/src/nanoarrow/visitor.py
Outdated
|
||
def begin(self, total_elements: Union[int, None] = None) -> None: | ||
for child_visitor in self._child_visitors: | ||
child_visitor.begin(total_elements) | ||
|
||
def visit_chunk_view(self, array_view: CArrayView) -> Any: | ||
if array_view.null_count > 0: | ||
raise ValueError("null_count > 0 encountered in ColumnsBuilder") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a brief comment about this in the code?
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
python/src/nanoarrow/visitor.py
Outdated
""" | ||
return ListBuilder.visit(self) | ||
|
||
def to_column_list(self, handle_nulls=None) -> Tuple[List[str], List[Sequence]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would to_table
potentially be more descriptive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not, cause it truly is a pylist of na.arrays. But it is tabular in nature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with to_columns_pysequence()
for now...it's not quite a table and I am not sure inventing one here is the solution either. Unfortunately it's also the most useful function! Hopefully its usage will give some guidance as to what it should be named or if it needs to be removed/live somewhere else.
python/src/nanoarrow/visitor.py
Outdated
""" | ||
return ColumnsBuilder.visit(self, handle_nulls=handle_nulls) | ||
|
||
def to_column(self, handle_nulls=None) -> Sequence: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct in thinking this is a concatenation function? e.g. chunked array -> array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe merge_chunks
or consolidate_chunks
or unify_chunks
could be a more descriptive name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it does concatenation + conversion to python lists at the same time, so just naming it about the first is also not descriptive enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, concatenation is a side effect of the fact that the non-Arrow pydata universe can't handle chunked data. This function is specifically trying to get data out of Arrow land. Your suggestion of to_pysequence()
is much better than to_column()
so I went with that!
python/src/nanoarrow/visitor.py
Outdated
def finish(self) -> List: | ||
return self._lst | ||
def finish(self) -> Any: | ||
return self._visitor.finish() | ||
|
||
|
||
class ColumnsBuilder(ArrayStreamVisitor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that I also had a bit of trouble comprehending the naming. I think it might be worthwhile to rename "column" to something else. Technically, its just a contiguous array, but array
is overloaded. Are these technically all concatenation helpers?
I hear everybody on the naming thing! "Column" is not great because it doesn't have a precedent here (the closest thing would be There is precedent for the term "convert" in the R bindings ( https://arrow.apache.org/nanoarrow/latest/r/reference/convert_array_stream.html ) and Arrow C++ ( https://github.com/apache/arrow/blob/2dbc5e26dcbc6826b4eb7a330fa8090836f6b727/cpp/src/arrow/util/converter.h#L40 ), and so I gave that terminology a try in the last few commits. The crux of what these helpers are trying to do is to get a stream of arrays (possibly of indeterminate length) out of Arrow land to be represented by something else. The default something else has to be limited to the Python standard library because of the zero dependencyness, which is "pybuffer or list". In the R bindings you can do things like: convert_array_stream(stream) # default conversion
convert_array_stream(stream, tibble::tibble()) # explicit output prototype Here, array.convert() # default conversion
array.convert(np.int32) # ...would get you an np.array with dtype int32 I will also mark these as "experimental" such that it's clear we're settling on the terminology/behaviour/scope here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the improvements! I did end up adding a more comments on naming. Let me know what you think.
python/src/nanoarrow/visitor.py
Outdated
""" | ||
return ColumnListConverter.visit(self, handle_nulls=handle_nulls) | ||
|
||
def convert(self, *, handle_nulls=None) -> Sequence: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like convert
! My last recommendation would be to_pysequence
instead. to_pybuffer
would also work if pybuffer is split from pylist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ehh maybe not use pysequence
because that would mean the object implements the python sequence protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although the type hint does say a sequence is returned.. do pybuffer and pylist both implement the sequence protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do!
python/src/nanoarrow/visitor.py
Outdated
v.finish() for v in self._child_visitors | ||
] | ||
|
||
|
||
class ListConverter(ArrayStreamVisitor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we named these all ArrayStreamTo<Type>Converter
?
python/src/nanoarrow/visitor.py
Outdated
@@ -124,56 +248,209 @@ def finish(self) -> Any: | |||
return None | |||
|
|||
|
|||
class ListBuilder(ArrayStreamVisitor): | |||
def __init__(self, schema, *, iterator_cls=PyIterator, array_view=None): | |||
class DispatchingConverter(ArrayStreamVisitor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer ArrayStreamToPySequenceConverter
here. Sorry to continue nitpicking the names, this is one of the last ones that I would change if that helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with ToPySequenceConverter
(for brevity) but it's a great point that ti's unclear if an XXXConverter
is converting to or from an XXX
!
python/src/nanoarrow/visitor.py
Outdated
|
||
|
||
class ColumnsBuilder(ArrayStreamVisitor): | ||
def __init__(self, schema, *, array_view=None): | ||
class ColumnListConverter(ArrayStreamVisitor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think naming this table or tabular would be more helpful to a user. Maybe ArrayStreamToTabularConverter
or perhaps we define a Table
typedef wrapper and call it ArrayStreamToTableConverter
?
Table = namedtuple('table', ['column_names', 'columns'])
My main feedback about using "column" is that its a subcomponent of a table or matrix, but there isn't any tabular object in nanoarrow right now. For example, Pandas calls these "Series" and uses "column" terminology when the object is a dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did keep the "column" terminology" here...the intention with the return value of this function is that the list of sequences will be subcomponents of a table...we're just agnostic here to exactly which "table" they are going to end up in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thanks for being receptive to all of our feedback.
python/src/nanoarrow/visitor.py
Outdated
>>> columns | ||
[nanoarrow.c_lib.CBuffer(int64[24 b] 1 2 3), ['a', 'b', 'c']] | ||
""" | ||
return ToColumnListConverter.visit(self, handle_nulls=handle_nulls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional nit: do we want to update this to be ToColumnsPySequenceConverter
now?
This PR implements building columns buffer-wise for the types where this makes sense. It also implements a few other changes:
item_size
was renamed toitemsize
to match the memoryview property nameArrayViewVisitable
mixin such that they are available in both theArray
andArrayView
without duplicating documentation.Functionally this means that the
Array
andArrayStream
now haveto_column()
andto_column_list()
methods that do something that more closely matches what somebody would expect.A quick demo:
This will basically get you data frame conversion: