Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Implement bindings to IPC writer #586

Merged
merged 22 commits into from
Sep 14, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Aug 14, 2024

This PR implements bindings to the IPC writer in the nanoarrow C library. This adds:

  • An ipc.StreamWriter() class roughly mirroring pyarrow's ipc.Stream()
  • Schema.serialize() and Array.serialize() to match pyarrow's serialize() methods.
import io
import nanoarrow as na 
from nanoarrow.ipc import StreamWriter, InputStream

out = io.BytesIO()
writer = StreamWriter.from_writable(out)
writer.write_stream(InputStream.example())

na.Array(InputStream.from_readable(out.getvalue()))
#> nanoarrow.Array<non-nullable struct<some_col: int32>>[3]
#> {'some_col': 1}
#> {'some_col': 2}
#> {'some_col': 3}

@paleolimbot paleolimbot marked this pull request as ready for review September 9, 2024 20:57
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Added some minor comments/questions, didn't go through everything, but already posting what I have for today

@@ -542,6 +542,29 @@ def __iter__(self):
"to iterate over elements of this Array"
)

def serialize(self, dst=None) -> Union[bytes, None]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose using bytes | None is only allowed if not supporting older versions of python?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Optional is OK but | is 3.10 or higher (we should be using Optional everywhere but I didn't know about it when I started writing type hints. At some point we can go back and change them all).

python/src/nanoarrow/array.py Outdated Show resolved Hide resolved
@@ -236,9 +243,198 @@ def __repr__(self) -> str:
return f"<{class_label} <invalid>>"


class Writer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a class docstring for Writer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to understand a bit the different classes here. The already existing Stream object is essentially the "reader" equivalent?

And for this writer, does that only support the stream format or also the file format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream is a bit of a compromise...we can't just "subclass a record batch reader" like is done in Arrow C++/Pyarrow, but we need a place to put the code that deals with file IO...so the stream is sort of an internal stepping stone to eliminate anything dealing with files in nanoarrow.array_stream.ArrayStream. It should have a better name now that its purpose is more clear.

Writer is definitely a user-facing concept and should probably be a StreamWriter (since it doesn't do files)

self._writer.write_end_of_stream()
self.release()

def write_stream(self, obj, schema=None, *, write_schema=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could in theory also be a write_array (or maybe rather write_batch) method that takes a single array to write to the stream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be! An array can be implicitly converted to a stream and will work here, but I'll add one for symmetry (and some day it might be faster!)

@paleolimbot paleolimbot merged commit 3aa0ec1 into apache:main Sep 14, 2024
12 checks passed
@paleolimbot paleolimbot deleted the python-ipc-write branch September 15, 2024 02:03
@paleolimbot paleolimbot added this to the nanoarrow 0.6.0 milestone Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants