Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Add column-wise buffer builder #464

Merged
merged 42 commits into from
May 17, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented May 13, 2024

This PR implements building columns buffer-wise for the types where this makes sense. It also implements a few other changes:

  • item_size was renamed to itemsize to match the memoryview property name
  • The visitor methods are now the ArrayViewVisitable mixin such that they are available in both the Array and ArrayView without duplicating documentation.

Functionally this means that the Array and ArrayStream now have to_column() and to_column_list() methods that do something that more closely matches what somebody would expect.

A quick demo:

import nanoarrow as na
import pyarrow as pa

batch = pa.record_batch({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
batch_with_nulls = pa.record_batch({"col1": [1, None, 3], "col2": ["a", "b", None]})

# Either builds a buffer or a list depending on column types
na.Array(batch).to_columns_pysequence()
#> (['col1', 'col2'],
#>  [nanoarrow.c_lib.CBuffer(int64[24 b] 1 2 3), ['a', 'b', 'c']])

# One can inject a null handler (a few experimental ones provided)
na.Array(batch_with_nulls).to_columns_pysequence(handle_nulls=na.nulls_as_sentinel())
#> (['col1', 'col2'], [array([ 1., nan,  3.]), ['a', 'b', None]])

# ...by default you have to choose how to do this or we error
na.Array(batch_with_nulls).to_columns_pysequence()
#> ValueError: Null present with null_handler=nulls_forbid()

This will basically get you data frame conversion:

import nanoarrow as na
import pandas as pd

url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
names, data = na.ArrayStream.from_url(url).to_columns_pysequence(handle_nulls=na.nulls_as_sentinel())
pd.DataFrame({k: v for k, v in zip(names, data)})
#>                                          commit                      time  \
#> 0      49cdb0fe4e98fda19031c864a18e6156c6edbf3c 2024-03-07 02:00:52+00:00   
#> 1      1d966e98e41ce817d1f8c5159c0b9caa4de75816 2024-03-06 21:51:34+00:00   
#> 2      96f26a89bd73997f7532643cdb27d04b70971530 2024-03-06 20:29:15+00:00   
#> 3      ee1a8c39a55f3543a82fed900dadca791f6e9f88 2024-03-06 07:46:45+00:00   
#> 4      3d467ac7bfae03cf2db09807054c5672e1959aec 2024-03-05 16:13:32+00:00   
#> ...                                         ...                       ...   
#> 15482  23c4b08d154f8079806a1f0258d7e4af17bdf5fd 2016-02-17 12:39:03+00:00   
#> 15483  16e44e3d456219c48595142d0a6814c9c950d30c 2016-02-17 12:38:39+00:00   
#> 15484  fa5f0299f046c46e1b2f671e5e3b4f1956522711 2016-02-17 12:38:39+00:00   
#> 15485  cbc56bf8ac423c585c782d5eda5c517ea8df8e3c 2016-02-17 12:38:39+00:00   
#> 15486  d5aa7c46692474376a3c31704cfc4783c86338f2 2016-02-05 20:08:35+00:00   
#> 
#>        files  merge                                            message  
#> 0          2  False  GH-40370: [C++] Define ARROW_FORCE_INLINE for ...  
#> 1          1  False     GH-40386: [Python] Fix except clauses (#40387)  
#> 2          1  False  GH-40227: [R] ensure executable files in `crea...  
#> 3          1  False  GH-40366: [C++] Remove const qualifier from Bu...  
#> 4          1  False  GH-20127: [Python][CI] Remove legacy hdfs test...  
#> ...      ...    ...                                                ...  
#> 15482     73  False  ARROW-4: This provides an partial C++11 implem...  
#> 15483      8  False  ARROW-3: This patch includes a WIP draft speci...  
#> 15484    124  False                 ARROW-1: Initial Arrow Code Commit  
#> 15485      2  False             Update readme and add license in root.  
#> 15486      1  False                                     Initial Commit  
#> 
#> [15487 rows x 5 columns]

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few quick comments!

@@ -49,7 +50,7 @@ def to_pylist(obj, schema=None) -> List:
return ListBuilder.visit(obj, schema)


def to_columns(obj, schema=None) -> Tuple[List[str], List[Sequence]]:
def to_columns(obj, schema=None, handle_nulls=None) -> Tuple[List[str], List[Sequence]]:
"""Convert ``obj`` to a ``list()` of sequences

Converts a stream of struct arrays into its column-wise representation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the line below, but can you then clarify when you get a buffer or when a list?

python/src/nanoarrow/visitor.py Outdated Show resolved Hide resolved
python/src/nanoarrow/visitor.py Outdated Show resolved Hide resolved
return handle


def nulls_as_masked_array():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be more useful to actually return each array as a tuple of data and mask (the numpy masked array isn't used that much)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! I'll remove that then 🙂

I wonder if it would be more useful to actually return each array as a tuple of data and mask

Currently that would be nulls_debug(), mostly because I couldn't think of a good name (in a previous version it was nulls_as_mask_and_data()). Returning a tuple here sort of breaks the "every column is a sequence" guarantee. There could also be a MaskedSequence class that wraps the mask and data but that seems like opening a can of worms.

@paleolimbot paleolimbot marked this pull request as ready for review May 14, 2024 20:40
Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! The Column builder code is quite nuanced, I need to do another pass over it.

from nanoarrow.iterator import ArrayViewBaseIterator, PyIterator
from nanoarrow.schema import Type


def to_pylist(obj, schema=None) -> List:
"""Convert ``obj`` to a ``list()` of Python objects
class ArrayViewVisitable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, love this!


>>> import nanoarrow as na
>>> import pyarrow as pa
>>> batch = pa.record_batch([pa.array([1, 2, 3])], names=["col1"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use two columns in the docstrings? I think it would better show the functionality.

return handle


def nulls_debug() -> Callable[[CBuffer, Sequence], Tuple[CBuffer, Sequence]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark this as an internal API? e.g. _nulls_debug()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name to nulls_separate() and exported it...from @jorisvandenbossche's comment it sounds like it would be useful (in the situation where caller wants to handle nulls completely on their own).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nulls_separate sounds in any case as a better name! (and yes I think it would be useful)


def begin(self, total_elements: Union[int, None] = None) -> None:
for child_visitor in self._child_visitors:
child_visitor.begin(total_elements)

def visit_chunk_view(self, array_view: CArrayView) -> Any:
if array_view.null_count > 0:
raise ValueError("null_count > 0 encountered in ColumnsBuilder")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are any nulls here the column results will be bogus! (We have no way to propagate a top-level nulls through into the child columns). At the "record batch" level nulls don't exist (typically they are exported as an explicitly non-nullable struct; however, at least one implementation doesn't do this so we have to check for actual nulls whilst iterating instead of erroring when we see the schema).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a brief comment about this in the code?


# Infer null count == 0 because of null data buffer
array = na.c_array_from_buffers(
na.int32(), 3, (None, na.c_buffer([1, 2, 3], na.int32())), null_count=-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to add null_count=-1 in these tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments to each of these...-1 is the sentinel for "not computed yet". Technically it's the default but here it's exactly what we're testing.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First set of comments. And thanks for the nice docstrings!

python/src/nanoarrow/visitor.py Outdated Show resolved Hide resolved
python/src/nanoarrow/visitor.py Outdated Show resolved Hide resolved
python/src/nanoarrow/visitor.py Outdated Show resolved Hide resolved
handle_nulls : callable
A function returning a sequence based on a validity bytemap and a
contiguous buffer of values (e.g., the callable returned by
:meth:`nulls_as_sentinel`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that the default is nulls_forbid()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe also list the built-in options


Converts a stream of arrays into a columnar representation
such that each column is either a contiguous buffer or a ``list()``.
Integer, float, and interval arrays are currently converted to their
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason interval arrays are returned as buffer? (just because there is no obvious python object?)

And why are other primitive fixed-width types like timestamp not returned as buffers? (because they are not a c primitive type? And have an obvious python object to use?)

I can imagine that for certain purposes, you would actually want the integers behind a timestamp column, because that is much cheaper to work with than a list of python datetime.datetime objects.

Should we make this user configurable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason interval arrays are returned as buffer? (just because there is no obvious python object?)

The technical answer is that schema_view.buffer_format returns "iiq" (i.e., not None)...I never really looked at what pyarrow does here but I see now that it returns a named tuple if converted to a list. The last time I tried to_numpy() on a pyarrow interval I got a crash ( apache/arrow#41326 ). I'm pretty happy to do anything here (or make a breaking change later to handle it properly).

And why are other primitive fixed-width types like timestamp not returned as buffers?

If just the storage were returned it would be lossy (i.e., pd.Series() would do the wrong thing). There's no way to communicate what to do here without invoking numpy or pandas-specific logic, so a list of Python objects is maybe a safer default.

Should we make this user configurable?

Totally (and also make Python object conversion configurable), but I'm not sure exactly how to do that yet. If somebody wanted to do this today, I'd suggest subclassing the visitor:

import nanoarrow as na
import pyarrow as pa
from datetime import datetime
from nanoarrow.visitor import ColumnsBuilder, NullableColumnBuilder


class CustomColumnsBuilder(ColumnsBuilder):

    def _resolve_child_visitor(self, child_schema, child_array_view, handle_nulls):
        if na.Schema(child_schema).type == na.Type.TIMESTAMP:
            return NullableColumnBuilder(na.int64(), handle_nulls=handle_nulls)
        else:
            return super()._resolve_child_visitor(
                child_schema, child_array_view, handle_nulls
            )

batch = pa.record_batch({"ts": [datetime.now()]})
CustomColumnsBuilder.visit(batch)
#> (['ts'], [nanoarrow.c_lib.CBuffer(int64[8 b] 1715862463542238)])

python/src/nanoarrow/visitor.py Outdated Show resolved Hide resolved
Comment on lines 93 to 94
schema : schema-like, optional
An optional schema, passed to :func:`c_array_stream`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not actually present in the signature?

Comment on lines 127 to 128
if len(is_valid) > 0:
raise ValueError("Null present with null_handler=nulls_forbid()")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not because there is a validity buffer that it includes any nulls? Or did you ensure that before? (eg not convert the bitmap to bytemap if the bitmap is all set?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes...the bool buffer indicating validity is never allocated unless a null is actually present (and is None otherwise). I added this to the handle_nulls documentation!

Comment on lines 140 to 141
The default sentinel value will result in ``nan`` assigned to null
values in numeric and boolean outputs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure this is the case for boolean? (setting None into a boolean array sets False)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but you ensure to convert int/bool to float in case there are nulls, so that setting as NaN indeed works correctly. Maybe clarify that here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a bit of a side-effect of how numpy handles None with result_type(). I added a comment clarifying this.

import numpy as np

sentinel = None
arr = np.array([True, False, True, True], np.result_type(sentinel, np.bool_))
arr[2] = sentinel
arr
#> array([ 1.,  0., nan,  1.])


>>> import nanoarrow as na
>>> na.Array([1, 2, 3], na.int32()).to_column(na.nulls_separate())
(nanoarrow.c_lib.CBuffer(uint8[0 b] ), nanoarrow.c_lib.CBuffer(int32[12 b] 1 2 3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return None here instead of an empty buffer? Because if you convert both outputs to numpy arrays, you can have numpy arrays of different length (of course, it's not that one of checking for None or for empty / len 0 is much harder than the other)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I updated this to return None, which is less ambiguous. One could also update this such that handle_nulls() is never called unless there are actually nulls?

def finish(self) -> List:
return self._lst
def finish(self) -> Any:
return self._visitor.finish()


class ColumnsBuilder(ArrayStreamVisitor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, and I don't directly have a better idea, but I found the usage of "columns" somewhat confusing here (and also "Builder", because that sounds to similar as the builders to build arrays/buffers in the python->arrow direction). Naming is hard .. ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been through a few iterations of naming 🙂 . Maybe sequence is better than column? The gist of it is "something that either numpy or Pandas can understand" and/or "we dealt with all the hard parts of a non-contiguous or non-zero offset array so you didn't have to". I could stick with Visitor instead of Builder, too (but these are internal for now so at least that part can get renamed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that I also had a bit of trouble comprehending the naming. I think it might be worthwhile to rename "column" to something else. Technically, its just a contiguous array, but array is overloaded. Are these technically all concatenation helpers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At Dane's suggestion these are now to_pysequence() and to_columns_pysequence(). I think these can/should get renamed when we see exactly how/if they are used (it may be that everybody will want custom logic and the high-level convert method will get unused).

:func:`c_array_stream`.
schema : schema-like, optional
An optional schema, passed to :func:`c_array_stream`.
handle_nulls : callable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This keyword is only used when the output is a buffer and not when it is a list?

Copy link
Member Author

@paleolimbot paleolimbot May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and only for explicitly nullable fields. Maybe there should be a handle_valid and handle_nulls? Or maybe just a handle_result()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this as is for now but I think there will eventually need to be some way to inject more customization into the process (including exactly when the null handler gets called).

python/src/nanoarrow/visitor.py Show resolved Hide resolved
python/src/nanoarrow/visitor.py Outdated Show resolved Hide resolved

def begin(self, total_elements: Union[int, None] = None) -> None:
for child_visitor in self._child_visitors:
child_visitor.begin(total_elements)

def visit_chunk_view(self, array_view: CArrayView) -> Any:
if array_view.null_count > 0:
raise ValueError("null_count > 0 encountered in ColumnsBuilder")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a brief comment about this in the code?

paleolimbot and others added 4 commits May 16, 2024 12:34
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
"""
return ListBuilder.visit(self)

def to_column_list(self, handle_nulls=None) -> Tuple[List[str], List[Sequence]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would to_table potentially be more descriptive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not, cause it truly is a pylist of na.arrays. But it is tabular in nature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with to_columns_pysequence() for now...it's not quite a table and I am not sure inventing one here is the solution either. Unfortunately it's also the most useful function! Hopefully its usage will give some guidance as to what it should be named or if it needs to be removed/live somewhere else.

"""
return ColumnsBuilder.visit(self, handle_nulls=handle_nulls)

def to_column(self, handle_nulls=None) -> Sequence:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in thinking this is a concatenation function? e.g. chunked array -> array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe merge_chunks or consolidate_chunks or unify_chunks could be a more descriptive name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it does concatenation + conversion to python lists at the same time, so just naming it about the first is also not descriptive enough

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, concatenation is a side effect of the fact that the non-Arrow pydata universe can't handle chunked data. This function is specifically trying to get data out of Arrow land. Your suggestion of to_pysequence() is much better than to_column() so I went with that!

def finish(self) -> List:
return self._lst
def finish(self) -> Any:
return self._visitor.finish()


class ColumnsBuilder(ArrayStreamVisitor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that I also had a bit of trouble comprehending the naming. I think it might be worthwhile to rename "column" to something else. Technically, its just a contiguous array, but array is overloaded. Are these technically all concatenation helpers?

@paleolimbot
Copy link
Member Author

I hear everybody on the naming thing! "Column" is not great because it doesn't have a precedent here (the closest thing would be pyarrow.Table.column(), which is still giving arrow arrays), and "Builder" is already used to describe the conversion of to an array. "Concatenate" might imply that we're returning an Array, which we could do or build using this machinery, but is not really the desired endpoint here.

There is precedent for the term "convert" in the R bindings ( https://arrow.apache.org/nanoarrow/latest/r/reference/convert_array_stream.html ) and Arrow C++ ( https://github.com/apache/arrow/blob/2dbc5e26dcbc6826b4eb7a330fa8090836f6b727/cpp/src/arrow/util/converter.h#L40 ), and so I gave that terminology a try in the last few commits.

The crux of what these helpers are trying to do is to get a stream of arrays (possibly of indeterminate length) out of Arrow land to be represented by something else. The default something else has to be limited to the Python standard library because of the zero dependencyness, which is "pybuffer or list". In the R bindings you can do things like:

convert_array_stream(stream)  # default conversion
convert_array_stream(stream, tibble::tibble())  # explicit output prototype

Here, visitable.convert() could do the same thing (although won't in this PR because it's a can of worms, and maybe not ever if nobody ends up using the high-level interface).

array.convert() # default conversion
array.convert(np.int32) # ...would get you an np.array with dtype int32

I will also mark these as "experimental" such that it's clear we're settling on the terminology/behaviour/scope here.

Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the improvements! I did end up adding a more comments on naming. Let me know what you think.

"""
return ColumnListConverter.visit(self, handle_nulls=handle_nulls)

def convert(self, *, handle_nulls=None) -> Sequence:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like convert! My last recommendation would be to_pysequence instead. to_pybuffer would also work if pybuffer is split from pylist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehh maybe not use pysequence because that would mean the object implements the python sequence protocol.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although the type hint does say a sequence is returned.. do pybuffer and pylist both implement the sequence protocol?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do!

v.finish() for v in self._child_visitors
]


class ListConverter(ArrayStreamVisitor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we named these all ArrayStreamTo<Type>Converter?

@@ -124,56 +248,209 @@ def finish(self) -> Any:
return None


class ListBuilder(ArrayStreamVisitor):
def __init__(self, schema, *, iterator_cls=PyIterator, array_view=None):
class DispatchingConverter(ArrayStreamVisitor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer ArrayStreamToPySequenceConverter here. Sorry to continue nitpicking the names, this is one of the last ones that I would change if that helps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with ToPySequenceConverter (for brevity) but it's a great point that ti's unclear if an XXXConverter is converting to or from an XXX!



class ColumnsBuilder(ArrayStreamVisitor):
def __init__(self, schema, *, array_view=None):
class ColumnListConverter(ArrayStreamVisitor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think naming this table or tabular would be more helpful to a user. Maybe ArrayStreamToTabularConverter or perhaps we define a Table typedef wrapper and call it ArrayStreamToTableConverter?

Table = namedtuple('table', ['column_names', 'columns'])

My main feedback about using "column" is that its a subcomponent of a table or matrix, but there isn't any tabular object in nanoarrow right now. For example, Pandas calls these "Series" and uses "column" terminology when the object is a dataframe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did keep the "column" terminology" here...the intention with the return value of this function is that the list of sequences will be subcomponents of a table...we're just agnostic here to exactly which "table" they are going to end up in.

Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Thanks for being receptive to all of our feedback.

>>> columns
[nanoarrow.c_lib.CBuffer(int64[24 b] 1 2 3), ['a', 'b', 'c']]
"""
return ToColumnListConverter.visit(self, handle_nulls=handle_nulls)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: do we want to update this to be ToColumnsPySequenceConverter now?

@paleolimbot paleolimbot merged commit d1b9924 into apache:main May 17, 2024
6 checks passed
@paleolimbot paleolimbot deleted the python-column-buffers branch May 17, 2024 20:14
@paleolimbot paleolimbot added this to the nanoarrow 0.5.0 milestone May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants