Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): basic export through PyCapsules #320

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Nov 22, 2023

Follow-up on #318

Exports the different objects using different mechanism:

  • ArrowSchema -> deep copy
  • ArrowArray -> shallow copy of the struct, update private_data/release to ref count the original's base
  • ArrowArrayStream -> move the struct (turning the source nanoarrow.ArrayStream object as released, but that's fine)

@jorisvandenbossche jorisvandenbossche changed the title feat(python): basic export through PyCapsules (when storing one) feat(python): basic export through PyCapsules Nov 22, 2023
@codecov-commenter
Copy link

codecov-commenter commented Nov 23, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (fda87c5) 86.99% compared to head (6b83d03) 89.10%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #320      +/-   ##
==========================================
+ Coverage   86.99%   89.10%   +2.11%     
==========================================
  Files          71        4      -67     
  Lines       10318      101   -10217     
==========================================
- Hits         8976       90    -8886     
+ Misses       1342       11    -1331     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 262 to 266
cdef ArrowSchema* c_schema_out = <ArrowSchema*> malloc(sizeof(ArrowSchema))
c_schema_out.release = NULL
ArrowSchemaDeepCopy(self._ptr, c_schema_out)

return PyCapsule_New(c_schema_out, 'arrow_schema', &pycapsule_schema_deleter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to reorder slightly for the (unlikely) case where ArrowDeepCopy() != NANOARROW_OK and/or a future contributor adds code that returns or raises an exception.

  • malloc(sizeof ArrowSchema)
  • .release = NULL
  • PyCapsule_New() + set deleter
  • result = ArrowSchemaDeepCopy() + Error.raise()if notNANOARROW_OK`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I haven't really been handling errors.

"""
if requested_schema is not None:
raise NotImplementedError("requested_schema")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably self._assert_valid() here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we should do that for all three, and will add tests for it as well

# counted and can be released separately

# shallow copy
cdef ArrowArray* c_array_out = <ArrowArray*> malloc(sizeof(ArrowArray))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be clearest to create the capsule directly after allocating the array and setting release = NULL (for those of us that aren't as good at knowing what Cython code can and cannot throw exceptions)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(because if somebody added code that threw an exception before the capsule was created, the malloc'ed memory here will leak)

@@ -54,6 +54,13 @@ def test_schema_import():
assert schema.format == "+s"
assert schema._to_string(recursive=True) == "struct<some_name: int32>"

# roundtrip
pa_schema2 = pa.schema(schema)
pa_schema2.equals(pa_schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pa_schema2.equals(pa_schema)
assert pa_schema2.equals(pa_schema)

cdef ArrowArrayStream* c_stream_out = <ArrowArrayStream*> malloc(
sizeof(ArrowArrayStream)
)
memcpy(c_stream_out, self._ptr, sizeof(ArrowArrayStream))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it seems trivial, but I think that the allocate -> set release to NULL -> create capsule with deleter is safest (i.e., minimize the number of lines where added code that throws an exception or returns will result in a leak).

@jorisvandenbossche
Copy link
Member Author

Thanks for the feedback! I should have paid more attention to the implementation in pyarrow, because we did something similar there to as early as possible create the capsule, for exactly the reasons that you commented

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thank you!

@jorisvandenbossche jorisvandenbossche merged commit 97bb65e into apache:main Nov 24, 2023
2 checks passed
@jorisvandenbossche jorisvandenbossche deleted the capsules-export-simple branch November 24, 2023 12:27
@paleolimbot paleolimbot added this to the nanoarrow 0.4.0 milestone Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants