GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table #14804

AlenkaF · 2022-12-01T15:00:15Z

This PR implements the Dataframe Interchange Protocol for pyarrow.Table.
See: https://data-apis.org/dataframe-protocol/latest/index.html

github-actions · 2022-12-01T15:00:48Z

https://issues.apache.org/jira/browse/ARROW-18152

github-actions · 2022-12-01T15:00:50Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

python/pyarrow/interchange/from_dataframe.py

python/pyarrow/tests/interchange/test_extra.py

AlenkaF · 2022-12-13T06:43:52Z

One more thing I will try to look at asap is the copies being made in the implementation and will add warnings in case allow_copy=False.

python/pyarrow/interchange/__init__.py

jorisvandenbossche

(partial review)

jorisvandenbossche · 2022-12-15T08:26:09Z

python/pyarrow/interchange/column.py

+if sys.version_info >= (3, 8):
+    from typing import TypedDict
+else:
+    from typing_extensions import TypedDict


Is this a third party package that needs to be installed?

I think it is part of the Python standard library: https://docs.python.org/3/library/typing.html.
Looking at it typing should be supported for versions 3.5 and up. Will change the check.

python/pyarrow/interchange/column.py

jorisvandenbossche · 2022-12-15T08:36:05Z

python/pyarrow/interchange/column.py

+                )
+                i += 1
+
+        elif isinstance(self._col, pa.ChunkedArray):


Above you combined the chunks when passing a ChunkedArray to the constructor, so that means self._col can currently never be a ChunkedArray?

But maybe we should try to support that and not always upfront call combine_chunks?

Above you combined the chunks when passing a ChunkedArray to the constructor, so that means self._col can currently never be a ChunkedArray?

Oh, I missed this part here when moving the combining of chunks to the constructor in 71ca596
Will correct.

But maybe we should try to support that and not always upfront call combine_chunks?

Just to summarize what we talked about (will add it to the code as a comment): when the dataframe is being consumed with from_dataframe method it iterates through the chunks of the dataframe:

arrow/python/pyarrow/interchange/from_dataframe.py

Line 104 in 71ca596

for chunk in df.get_chunks():

arrow/python/pyarrow/interchange/dataframe.py

Lines 161 to 190 in 71ca596

def get_chunks(

self, n_chunks: Optional[int] = None

) -> Iterable[_PyArrowDataFrame]:

"""

Return an iterator yielding the chunks.

By default (None), yields the chunks that the data is stored as by the

producer. If given, ``n_chunks`` must be a multiple of

``self.num_chunks()``, meaning the producer must subdivide each chunk

before yielding it.

Note that the producer must ensure that all columns are chunked the

same way.

"""

if n_chunks and n_chunks > 1:

chunk_size = self.num_rows() // n_chunks

if self.num_rows() % n_chunks != 0:

chunk_size += 1

batches = self._df.to_batches(max_chunksize=chunk_size)

# In case when the size of the chunk is such that the resulting

# list is one less chunk then n_chunks -> append an empty chunk

if len(batches) == n_chunks - 1:

batches.append(pa.record_batch([[]], schema=self._df.schema))

else:

batches = self._df.to_batches()

iterator_tables = [_PyArrowDataFrame(

pa.Table.from_batches([batch]), self._nan_as_null, self._allow_copy

)

for batch in batches

]

return iterator_tables

and so is always converting part of the dataframe that is not chunked. For that reason there is no need to support chunking on the column level as it is only possible to get a chunked array if calling Column class directly, not through the DataFrame class.

python/pyarrow/interchange/column.py

python/pyarrow/interchange/dataframe.py

AlenkaF · 2022-12-20T10:48:06Z

@jorisvandenbossche I think I addressed all the comments and topics we have talked about. Some information about the recent changes:

Large string dtype is working correctly. Pandas doesn't support this type of data so I changed the adaptation and added a test to check for error due to not supported dtype. PyArrow roundtrip works correctly - I had to construct an array with pyarrow.LargeStringArray.from_buffers and not pyarrow.Array.from_buffers in case of large string dtype.
String dtype is removed from pandas roundtrip tests as pandas defines .size() as a method in column.py but calls it as a property in from_dataframe.py and so the roundtrip with pandas errors for string dtypes.
I have added better test coverage with parametrizing some of the tests and also using strategies for one of the test in test_interchange_spec.py.
Now an error is raised if nan_as_null=True, I also added a test for it.
I have exposed from_dataframe in interchange/__init__.py, hope I have done it correctly.

So I think this PR is ready for another round of review 🙏

jorisvandenbossche

Great work! Continuing my review (now mostly looked at from_dataframe)

python/pyarrow/interchange/from_dataframe.py

python/pyarrow/interchange/dataframe.py

python/pyarrow/interchange/from_dataframe.py

python/pyarrow/tests/interchange/test_extra.py

python/pyarrow/interchange/from_dataframe.py

python/pyarrow/tests/interchange/test_extra.py

AlenkaF · 2023-01-05T05:19:25Z

@jorisvandenbossche I have gone through all of the suggestions, I think this PR is ready for another round of review. Thank you!

jorisvandenbossche

String dtype is removed from pandas roundtrip tests as pandas defines .size() as a method in column.py but calls it as a property in from_dataframe.py and so the roundtrip with pandas errors for string dtypes.

It should be possible to add this back now (since it is fixed in pandas main), but skip the string dtype depending on the pandas version (only the 2.0.0.dev version runs the test)

python/pyarrow/interchange/column.py

python/pyarrow/interchange/dataframe.py

python/pyarrow/interchange/from_dataframe.py

jorisvandenbossche · 2023-01-10T08:45:49Z

@github-actions crossbow submit test-conda-python-3.9-pandas-upstream_devel

github-actions · 2023-01-10T08:47:16Z

Unable to match any tasks for `test-conda-python-3.9-pandas-upstream_devel`
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/3881649186

jorisvandenbossche · 2023-01-10T09:25:42Z

@github-actions crossbow submit test-conda-python-3.8-pandas-nightly

github-actions · 2023-01-10T09:28:46Z

Revision: 04d6f3b

Submitted crossbow builds: ursacomputing/crossbow @ actions-693c58f0a7

Task	Status
test-conda-python-3.8-pandas-nightly

AlenkaF · 2023-01-10T15:10:22Z

@jorisvandenbossche the code review suggestions are all addressed. If I am not mistaken you still want to review the tests?

jorisvandenbossche · 2023-01-10T17:01:07Z

@github-actions crossbow submit test-conda-python-3.8-pandas-nightly

github-actions · 2023-01-10T17:03:34Z

Revision: 1fe490c

Submitted crossbow builds: ursacomputing/crossbow @ actions-86fb33df92

Task	Status
test-conda-python-3.8-pandas-nightly

github-actions · 2023-01-11T09:10:21Z

Closes: [Python] DataFrame Interchange Protocol for pyarrow Table #33346

…sentinel

…m_dataframe.py Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche · 2023-01-11T09:29:10Z

@github-actions crossbow submit test-conda-python-3.8-pandas-nightly

jorisvandenbossche · 2023-01-12T08:19:05Z

@github-actions crossbow submit test-conda-python-3.8-pandas-nightly

github-actions · 2023-01-12T08:21:30Z

Revision: e937b4c

Submitted crossbow builds: ursacomputing/crossbow @ actions-2377096f27

Task	Status
test-conda-python-3.8-pandas-nightly

AlenkaF · 2023-01-12T12:35:20Z

@github-actions crossbow submit test-conda-python-3.8-pandas-nightly

github-actions · 2023-01-12T12:38:02Z

Revision: 1b5f248

Submitted crossbow builds: ursacomputing/crossbow @ actions-0169982f7e

Task	Status
test-conda-python-3.8-pandas-nightly

AlenkaF · 2023-01-12T15:07:51Z

@github-actions crossbow submit test-conda-python-3.8-pandas-nightly

github-actions · 2023-01-12T15:10:31Z

Revision: 9139444

Submitted crossbow builds: ursacomputing/crossbow @ actions-17d90bfc40

Task	Status
test-conda-python-3.8-pandas-nightly

jorisvandenbossche · 2023-01-12T17:34:00Z

python/pyarrow/tests/interchange/test_conversion.py

+    )
+    pandas_df = pandas_from_dataframe(table)
+    result = pi.from_dataframe(pandas_df)
+


Is there a assert table.equals(result) missing here (like there is in the test above)?

Due to pandas defining int64 offset for what is in our case normal string, not large, the dtype that is at the end of the roundtrip becomes large_string. Due to that, the assertion is done with pylist for the values and separate for dtype (first inormal string, then large string).

AlenkaF · 2023-01-13T10:00:13Z

@jorisvandenbossche do you have any blocking issues I can correct today to make this PR merged before the release freeze? Hope the answer to the question about assertions on test_roundtrip_pandas_string was clear enough.

rok · 2023-01-13T13:20:40Z

Great to see this merged!

ursabot · 2023-01-14T19:15:34Z

Benchmark runs are scheduled for baseline = b55dd0e and contender = a83cc85. a83cc85 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.54% ⬆️0.03%] test-mac-arm
[Finished ⬇️2.04% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.16% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a83cc852 ec2-t3-xlarge-us-east-2
[Finished] a83cc852 test-mac-arm
[Finished] a83cc852 ursa-i9-9960x
[Finished] a83cc852 ursa-thinkcentre-m75q
[Finished] b55dd0e6 ec2-t3-xlarge-us-east-2
[Finished] b55dd0e6 test-mac-arm
[Finished] b55dd0e6 ursa-i9-9960x
[Finished] b55dd0e6 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…atch (#34294) ### Rationale for this change Add the implementation of the Dataframe Interchange Protocol for `pyarrow.RecordBatch`. The protocol is already implemented for pyarrow.Table, see #14804. ### Are these changes tested? Yes, tests are added to: - python/pyarrow/tests/interchange/test_interchange_spec.py - python/pyarrow/tests/interchange/test_conversion.py * Closes: #33926 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

AlenkaF mentioned this pull request Dec 1, 2022

ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table #14613

Closed

8 tasks

github-actions bot added the Component: Python label Dec 1, 2022

AlenkaF commented Dec 13, 2022

View reviewed changes

python/pyarrow/interchange/from_dataframe.py Show resolved Hide resolved

AlenkaF commented Dec 13, 2022

View reviewed changes

python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved

AlenkaF commented Dec 13, 2022

View reviewed changes

python/pyarrow/tests/interchange/test_extra.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Dec 15, 2022

View reviewed changes

python/pyarrow/interchange/__init__.py Show resolved Hide resolved

jorisvandenbossche reviewed Dec 15, 2022

View reviewed changes

jorisvandenbossche reviewed Dec 21, 2022

View reviewed changes

python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Dec 21, 2022

View reviewed changes

python/pyarrow/tests/interchange/test_extra.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Jan 10, 2023

View reviewed changes

asfimport mentioned this pull request Jan 10, 2023

[Python] DataFrame Interchange Protocol for pyarrow Table #33346

Closed

AlenkaF changed the title ~~ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table~~ GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table Jan 11, 2023

AlenkaF added 4 commits January 11, 2023 10:27

Produce a __dataframe__ object - squshed commits from apache#14613

ca526a7

Add column convert methods

d0ca2b1

Fix linter errors

c356cd1

Add from_dataframe method details

8fb50c5

AlenkaF and others added 5 commits January 11, 2023 10:27

Add a comment for float16 NotImplementedError in validity_buffer_nan_…

f2a65a6

…sentinel

Update validity_buffer_nan_sentinel in python/pyarrow/interchange/fro…

075e888

…m_dataframe.py Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Make change to the offset buffers part of buffers_to_array

efa12d6

Linter correction

858cadb

Update the handling of allow_copy keyword

e937b4c

AlenkaF force-pushed the ARROW-18152-second branch from 1fe490c to e937b4c Compare January 11, 2023 09:28

Fix failing nightly test

1b5f248

Fix the fix for the failing test

9139444

jorisvandenbossche reviewed Jan 12, 2023

View reviewed changes

jorisvandenbossche approved these changes Jan 13, 2023

View reviewed changes

jorisvandenbossche merged commit a83cc85 into apache:master Jan 13, 2023

AlenkaF deleted the ARROW-18152-second branch January 13, 2023 10:33

jorisvandenbossche mentioned this pull request Jan 25, 2023

feat(python): DataFrame interchange protocol implementation pola-rs/polars#5662

Closed

28 tasks

This was referenced Jan 30, 2023

[Python] DataFrame Interchange Protocol for pyarrow.RecordBatch #33926

Closed

[Docs][Python] Document DataFrame Interchange Protocol implementation and usage #33980

Closed

jorisvandenbossche mentioned this pull request Feb 3, 2023

[Python][Rust] Create extension point in python for Dataset/Scanner #33986

Open

mattijn mentioned this pull request Feb 16, 2023

Support DataFrame Interchange Protocol (allow Polars DataFrames) vega/altair#2888

Merged

AlenkaF mentioned this pull request Feb 22, 2023

GH-33926: [Python] DataFrame Interchange Protocol for pyarrow.RecordBatch #34294

Merged

AlenkaF mentioned this pull request Mar 27, 2023

Add wrapper for pyarrow data-apis/dataframe-interchange-tests#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table #14804

GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table #14804

AlenkaF commented Dec 1, 2022 •

edited

Loading

github-actions bot commented Dec 1, 2022

github-actions bot commented Dec 1, 2022

AlenkaF commented Dec 13, 2022

jorisvandenbossche left a comment

jorisvandenbossche Dec 15, 2022

AlenkaF Dec 15, 2022

jorisvandenbossche Dec 15, 2022

AlenkaF Dec 15, 2022

AlenkaF commented Dec 20, 2022

jorisvandenbossche left a comment

AlenkaF commented Jan 5, 2023

jorisvandenbossche left a comment

jorisvandenbossche commented Jan 10, 2023

github-actions bot commented Jan 10, 2023

jorisvandenbossche commented Jan 10, 2023

github-actions bot commented Jan 10, 2023

AlenkaF commented Jan 10, 2023

jorisvandenbossche commented Jan 10, 2023

github-actions bot commented Jan 10, 2023

github-actions bot commented Jan 11, 2023

jorisvandenbossche commented Jan 11, 2023

jorisvandenbossche commented Jan 12, 2023

github-actions bot commented Jan 12, 2023

AlenkaF commented Jan 12, 2023

github-actions bot commented Jan 12, 2023

AlenkaF commented Jan 12, 2023

github-actions bot commented Jan 12, 2023

jorisvandenbossche Jan 12, 2023

AlenkaF Jan 12, 2023 •

edited

Loading

AlenkaF commented Jan 13, 2023

rok commented Jan 13, 2023

ursabot commented Jan 14, 2023

	def get_chunks(
	self, n_chunks: Optional[int] = None
	) -> Iterable[_PyArrowDataFrame]:
	"""
	Return an iterator yielding the chunks.
	By default (None), yields the chunks that the data is stored as by the
	producer. If given, ``n_chunks`` must be a multiple of
	``self.num_chunks()``, meaning the producer must subdivide each chunk
	before yielding it.
	Note that the producer must ensure that all columns are chunked the
	same way.
	"""
	if n_chunks and n_chunks > 1:
	chunk_size = self.num_rows() // n_chunks
	if self.num_rows() % n_chunks != 0:
	chunk_size += 1
	batches = self._df.to_batches(max_chunksize=chunk_size)
	# In case when the size of the chunk is such that the resulting
	# list is one less chunk then n_chunks -> append an empty chunk
	if len(batches) == n_chunks - 1:
	batches.append(pa.record_batch([[]], schema=self._df.schema))
	else:
	batches = self._df.to_batches()

	iterator_tables = [_PyArrowDataFrame(
	pa.Table.from_batches([batch]), self._nan_as_null, self._allow_copy
	)
	for batch in batches
	]
	return iterator_tables

GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table #14804

GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table #14804

Conversation

AlenkaF commented Dec 1, 2022 • edited Loading

github-actions bot commented Dec 1, 2022

github-actions bot commented Dec 1, 2022

AlenkaF commented Dec 13, 2022

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Dec 15, 2022

Choose a reason for hiding this comment

AlenkaF Dec 15, 2022

Choose a reason for hiding this comment

jorisvandenbossche Dec 15, 2022

Choose a reason for hiding this comment

AlenkaF Dec 15, 2022

Choose a reason for hiding this comment

AlenkaF commented Dec 20, 2022

jorisvandenbossche left a comment

Choose a reason for hiding this comment

AlenkaF commented Jan 5, 2023

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 10, 2023

github-actions bot commented Jan 10, 2023

jorisvandenbossche commented Jan 10, 2023

github-actions bot commented Jan 10, 2023

AlenkaF commented Jan 10, 2023

jorisvandenbossche commented Jan 10, 2023

github-actions bot commented Jan 10, 2023

github-actions bot commented Jan 11, 2023

jorisvandenbossche commented Jan 11, 2023

jorisvandenbossche commented Jan 12, 2023

github-actions bot commented Jan 12, 2023

AlenkaF commented Jan 12, 2023

github-actions bot commented Jan 12, 2023

AlenkaF commented Jan 12, 2023

github-actions bot commented Jan 12, 2023

jorisvandenbossche Jan 12, 2023

Choose a reason for hiding this comment

AlenkaF Jan 12, 2023 • edited Loading

Choose a reason for hiding this comment

AlenkaF commented Jan 13, 2023

rok commented Jan 13, 2023

ursabot commented Jan 14, 2023

AlenkaF commented Dec 1, 2022 •

edited

Loading

AlenkaF Jan 12, 2023 •

edited

Loading