Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Implement Arrow PyCapsule Interface for Series/DataFrame export #17676

Merged
merged 14 commits into from
Jul 25, 2024

Conversation

kylebarron
Copy link
Contributor

@kylebarron kylebarron commented Jul 16, 2024

Progress towards #12530.

I added one minimal test for the Series export and it appears to work:

a = pl.Series("a", [1, 2, 3, None])
pyarrow_chunked = pa.chunked_array(a)
assert pyarrow_chunked.combine_chunks() == pa.array([1, 2, 3, None])

I added a test for DataFrame stream export and it works as well. You can pass pa.table(polars.DataFrame) and it'll just work.

@@ -19,6 +19,8 @@ impl Drop for ArrowArrayStream {
}
}

unsafe impl Send for ArrowArrayStream {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eitsupi
Copy link
Contributor

eitsupi commented Jul 16, 2024

I'm hitting some lifetime issues with the DataFrame export, but I figured I'd create the PR and we can discuss.

Have you seen #14208?

Copy link

codecov bot commented Jul 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.50%. Comparing base (66f0026) to head (d40f696).
Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17676      +/-   ##
==========================================
+ Coverage   80.47%   80.50%   +0.03%     
==========================================
  Files        1503     1503              
  Lines      197115   197100      -15     
  Branches     2794     2804      +10     
==========================================
+ Hits       158628   158684      +56     
+ Misses      37973    37896      -77     
- Partials      514      520       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kylebarron
Copy link
Contributor Author

I'm hitting some lifetime issues with the DataFrame export, but I figured I'd create the PR and we can discuss.

Have you seen #14208?

I ended up vendoring that code as part of this PR.

Just checking, when you call DataFrame.clone() is that a full memory copy of the input or are arrays reference counted somewhere?

I added a test for DataFrame export as well, so this should be good to review.

@eitsupi
Copy link
Contributor

eitsupi commented Jul 18, 2024

I ended up vendoring that code as part of this PR.

I am wondering if that should be added to polars-core or somewhere else instead of py-polars. (i.e. the code must be copied downstream each time like py-polars or r-polars unless it is included in the polars crate)
Of course this can be done later with follow up PRs.

Just checking, when you call DataFrame.clone() is that a full memory copy of the input or are arrays reference counted somewhere?

I'm not familiar with the polars internals, but I'm pretty sure that DataFrame.clone() isn't actually copying data (Python Polars does clone everywhere, but that's not slowing it down, is it?)

df = pl.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
out = pa.table(PyCapsuleStreamHolder(df.__arrow_c_stream__(None)))
assert df.shape == out.shape
assert df.schema.names() == out.schema.names
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could drop df just now and make sure that the recreated df2 below still gets the expected contents (instead of crashing or whatever else).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the test to not hold a bare capsule, but rather call the underlying object's __arrow_c_stream__ method. I'm not sure what you're suggesting this test, since I need to check below that df and df2 are equal. Are you suggesting after that I should drop df again? That isn't possible when this utility class doesn't hold bare capsules


a = pl.Series("a", [1, 2, 3, None])
out = pa.chunked_array(PyCapsuleSeriesHolder(a.__arrow_c_stream__(None)))
out_arr = out.combine_chunks()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same idea here (drop a before doing things with out)

@kylebarron kylebarron changed the title Implement Arrow PyCapsule Interface for Series/DataFrame export feat(python): Implement Arrow PyCapsule Interface for Series/DataFrame export Jul 22, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature and removed title needs formatting labels Jul 22, 2024
@github-actions github-actions bot added the python Related to Python Polars label Jul 22, 2024
from typing import Any


class PyCapsuleStreamHolder:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is put in a helper file because it's used by tests both in this PR and in https://github.com/pola-rs/polars/pull/17693/files. Let me know if there's a better place to put this test helper.

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Kyle, I've left some comments.

series: &'py Series,
py: Python<'py>,
) -> PyResult<Bound<'py, PyCapsule>> {
let field = series.field().to_arrow(CompatLevel::oldest());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this should be newest, otherwise we trigger a copy whereas the consumer should decide if they want to cast to a datatype they can support.

Copy link
Contributor

@ruihe774 ruihe774 Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why requested_schema is not used? I think it instead of CompatLevel should decides what schema should be used (e.g. LargeString or Utf8View). In the future, imo it can replace CompatLevel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why requested_schema is not used?

Does the protocol allow for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why requested_schema is not used?

Does the protocol allow for this?

https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html#schema-requests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, then I agree request_schema should be respected and if none given we can default to newest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's been discussion about this in apache/arrow#39689. To be able to pass in a requested_schema argument, the consumer needs to know the schema of the producer's existing Arrow data. Only then can it know whether it needs to ask the producer to cast to a different type.

I believe I summarized the consensus in apache/arrow#39689 (comment), but while waiting for confirmation, I think it would be best for us to leave requested_schema and schema negotiation to a follow up PR, if that's ok.

py-polars/src/interop/arrow/to_py.rs Outdated Show resolved Hide resolved
py-polars/src/interop/arrow/to_py.rs Outdated Show resolved Hide resolved
py-polars/src/interop/arrow/to_py.rs Outdated Show resolved Hide resolved
py-polars/src/interop/arrow/to_py.rs Show resolved Hide resolved
py-polars/src/interop/arrow/to_py.rs Outdated Show resolved Hide resolved
py-polars/src/interop/arrow/to_py.rs Outdated Show resolved Hide resolved
py-polars/src/interop/arrow/to_py.rs Outdated Show resolved Hide resolved
py-polars/src/interop/arrow/to_py.rs Outdated Show resolved Hide resolved
py-polars/tests/unit/interop/test_interop.py Outdated Show resolved Hide resolved
@ruihe774
Copy link
Contributor

FWIW, I'm curious about whether it's possible to implement Series/DataFrame importing from PyCapsule. And if it is possible, can we migrate current FFI interfaces (Series._import_arrow_from_c and Array._export_to_c) to PyCapsule?

@ritchie46
Copy link
Member

What would be the benefit of that? (I am on the camp, if it aint broke, don't fix it. ;) )

@ruihe774
Copy link
Contributor

What would be the benefit of that? (I am on the camp, if it aint broke, don't fix it. ;) )

  • We can drop the dependency on pyarrow's Array._export_to_c when exporting DataFrame/Series.
  • pyo3-polars can work with any python objects that support Arrow PyCapsule interface, not limited to polars dataframes. Imagine that users can directly pass pandas dataframes to pyo3 extensions.

@kylebarron
Copy link
Contributor Author

FWIW, I'm curious about whether it's possible to implement Series/DataFrame importing from PyCapsule

@ruihe774 have you seen #17693?

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Thanks a lot @kylebarron. Once all is in, can you follow up with an update on the user guide. We have a section on Arrow C interop, which should expose the capsule method as well.

@ritchie46 ritchie46 merged commit 9978d88 into pola-rs:main Jul 25, 2024
27 checks passed
@kylebarron kylebarron deleted the kyle/export-pycapsule-interface branch July 25, 2024 15:45
@kylebarron
Copy link
Contributor Author

Once all is in, can you follow up with an update on the user guide. We have a section on Arrow C interop, which should expose the capsule method as well.

Can you point me to where this is? Do you mean this paragraph? https://docs.pola.rs/user-guide/ecosystem/#apache-arrow

@ritchie46
Copy link
Member

It isn't released yet, but it is this page: https://github.com/pola-rs/polars/blob/main/docs/user-guide/misc/arrow.md

@kylebarron
Copy link
Contributor Author

kylebarron commented Jul 25, 2024

I see. I see those APIs from #17696 were just added, but I'd personally argue to deprecate them. The PyCapsule Interface should be a strict improvement over those APIs:

  • No need for the caller to know anything about polars and to know Polars' semi-private APIs.
  • No need for the caller to specifically rechunk a polars Series or iterate over a Python list of chunks. The Arrow C Stream will have the same number of chunks as the Series has.
  • No memory leaks: with _export_arrow_to_c if the caller doesn't import the exported pointers, memory leaks. With PyCapsules, Drop is called when the capsule goes out of Python scope if it hasn't been imported, so memory can't leak.
  • Works on both a Series and a DataFrame

Regardless, I'll make a docs PR to add to that page

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants