Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table #14804

Merged
merged 49 commits into from
Jan 13, 2023

Conversation

AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Dec 1, 2022

This PR implements the Dataframe Interchange Protocol for pyarrow.Table.
See: https://data-apis.org/dataframe-protocol/latest/index.html

@github-actions
Copy link

github-actions bot commented Dec 1, 2022

@github-actions
Copy link

github-actions bot commented Dec 1, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 13, 2022

One more thing I will try to look at asap is the copies being made in the implementation and will add warnings in case allow_copy=False.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(partial review)

if sys.version_info >= (3, 8):
from typing import TypedDict
else:
from typing_extensions import TypedDict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a third party package that needs to be installed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is part of the Python standard library: https://docs.python.org/3/library/typing.html.
Looking at it typing should be supported for versions 3.5 and up. Will change the check.

python/pyarrow/interchange/column.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/column.py Show resolved Hide resolved
python/pyarrow/interchange/column.py Outdated Show resolved Hide resolved
)
i += 1

elif isinstance(self._col, pa.ChunkedArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above you combined the chunks when passing a ChunkedArray to the constructor, so that means self._col can currently never be a ChunkedArray?

But maybe we should try to support that and not always upfront call combine_chunks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above you combined the chunks when passing a ChunkedArray to the constructor, so that means self._col can currently never be a ChunkedArray?

Oh, I missed this part here when moving the combining of chunks to the constructor in 71ca596
Will correct.

But maybe we should try to support that and not always upfront call combine_chunks?

Just to summarize what we talked about (will add it to the code as a comment): when the dataframe is being consumed with from_dataframe method it iterates through the chunks of the dataframe:

for chunk in df.get_chunks():
def get_chunks(
self, n_chunks: Optional[int] = None
) -> Iterable[_PyArrowDataFrame]:
"""
Return an iterator yielding the chunks.
By default (None), yields the chunks that the data is stored as by the
producer. If given, ``n_chunks`` must be a multiple of
``self.num_chunks()``, meaning the producer must subdivide each chunk
before yielding it.
Note that the producer must ensure that all columns are chunked the
same way.
"""
if n_chunks and n_chunks > 1:
chunk_size = self.num_rows() // n_chunks
if self.num_rows() % n_chunks != 0:
chunk_size += 1
batches = self._df.to_batches(max_chunksize=chunk_size)
# In case when the size of the chunk is such that the resulting
# list is one less chunk then n_chunks -> append an empty chunk
if len(batches) == n_chunks - 1:
batches.append(pa.record_batch([[]], schema=self._df.schema))
else:
batches = self._df.to_batches()
iterator_tables = [_PyArrowDataFrame(
pa.Table.from_batches([batch]), self._nan_as_null, self._allow_copy
)
for batch in batches
]
return iterator_tables

and so is always converting part of the dataframe that is not chunked. For that reason there is no need to support chunking on the column level as it is only possible to get a chunked array if calling Column class directly, not through the DataFrame class.

python/pyarrow/interchange/column.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/column.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/column.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/dataframe.py Outdated Show resolved Hide resolved
@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 20, 2022

@jorisvandenbossche I think I addressed all the comments and topics we have talked about. Some information about the recent changes:

  • Large string dtype is working correctly. Pandas doesn't support this type of data so I changed the adaptation and added a test to check for error due to not supported dtype. PyArrow roundtrip works correctly - I had to construct an array with pyarrow.LargeStringArray.from_buffers and not pyarrow.Array.from_buffers in case of large string dtype.
  • String dtype is removed from pandas roundtrip tests as pandas defines .size() as a method in column.py but calls it as a property in from_dataframe.py and so the roundtrip with pandas errors for string dtypes.
  • I have added better test coverage with parametrizing some of the tests and also using strategies for one of the test in test_interchange_spec.py.
  • Now an error is raised if nan_as_null=True, I also added a test for it.
  • I have exposed from_dataframe in interchange/__init__.py, hope I have done it correctly.

So I think this PR is ready for another round of review 🙏

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Continuing my review (now mostly looked at from_dataframe)

python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/tests/interchange/test_extra.py Outdated Show resolved Hide resolved
python/pyarrow/tests/interchange/test_extra.py Outdated Show resolved Hide resolved
python/pyarrow/tests/interchange/test_extra.py Outdated Show resolved Hide resolved
python/pyarrow/tests/interchange/test_extra.py Outdated Show resolved Hide resolved
@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 5, 2023

@jorisvandenbossche I have gone through all of the suggestions, I think this PR is ready for another round of review. Thank you!

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • String dtype is removed from pandas roundtrip tests as pandas defines .size() as a method in column.py but calls it as a property in from_dataframe.py and so the roundtrip with pandas errors for string dtypes.

It should be possible to add this back now (since it is fixed in pandas main), but skip the string dtype depending on the pandas version (only the 2.0.0.dev version runs the test)

python/pyarrow/interchange/column.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/from_dataframe.py Show resolved Hide resolved
python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved
python/pyarrow/interchange/from_dataframe.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit test-conda-python-3.9-pandas-upstream_devel

@github-actions
Copy link

Unable to match any tasks for `test-conda-python-3.9-pandas-upstream_devel`
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/3881649186

@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit test-conda-python-3.8-pandas-nightly

@github-actions
Copy link

Revision: 04d6f3b

Submitted crossbow builds: ursacomputing/crossbow @ actions-693c58f0a7

Task Status
test-conda-python-3.8-pandas-nightly Github Actions

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 10, 2023

@jorisvandenbossche the code review suggestions are all addressed. If I am not mistaken you still want to review the tests?

@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit test-conda-python-3.8-pandas-nightly

@github-actions
Copy link

Revision: 1fe490c

Submitted crossbow builds: ursacomputing/crossbow @ actions-86fb33df92

Task Status
test-conda-python-3.8-pandas-nightly Github Actions

@AlenkaF AlenkaF changed the title ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table Jan 11, 2023
@github-actions
Copy link

@AlenkaF AlenkaF force-pushed the ARROW-18152-second branch from 1fe490c to e937b4c Compare January 11, 2023 09:28
@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit test-conda-python-3.8-pandas-nightly

1 similar comment
@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit test-conda-python-3.8-pandas-nightly

@github-actions
Copy link

Revision: e937b4c

Submitted crossbow builds: ursacomputing/crossbow @ actions-2377096f27

Task Status
test-conda-python-3.8-pandas-nightly Github Actions

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 12, 2023

@github-actions crossbow submit test-conda-python-3.8-pandas-nightly

@github-actions
Copy link

Revision: 1b5f248

Submitted crossbow builds: ursacomputing/crossbow @ actions-0169982f7e

Task Status
test-conda-python-3.8-pandas-nightly Github Actions

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 12, 2023

@github-actions crossbow submit test-conda-python-3.8-pandas-nightly

@github-actions
Copy link

Revision: 9139444

Submitted crossbow builds: ursacomputing/crossbow @ actions-17d90bfc40

Task Status
test-conda-python-3.8-pandas-nightly Github Actions

)
pandas_df = pandas_from_dataframe(table)
result = pi.from_dataframe(pandas_df)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a assert table.equals(result) missing here (like there is in the test above)?

Copy link
Member Author

@AlenkaF AlenkaF Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to pandas defining int64 offset for what is in our case normal string, not large, the dtype that is at the end of the roundtrip becomes large_string. Due to that, the assertion is done with pylist for the values and separate for dtype (first inormal string, then large string).

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 13, 2023

@jorisvandenbossche do you have any blocking issues I can correct today to make this PR merged before the release freeze? Hope the answer to the question about assertions on test_roundtrip_pandas_string was clear enough.

@jorisvandenbossche jorisvandenbossche merged commit a83cc85 into apache:master Jan 13, 2023
@AlenkaF AlenkaF deleted the ARROW-18152-second branch January 13, 2023 10:33
@rok
Copy link
Member

rok commented Jan 13, 2023

Great to see this merged!

@ursabot
Copy link

ursabot commented Jan 14, 2023

Benchmark runs are scheduled for baseline = b55dd0e and contender = a83cc85. a83cc85 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.54% ⬆️0.03%] test-mac-arm
[Finished ⬇️2.04% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.16% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a83cc852 ec2-t3-xlarge-us-east-2
[Finished] a83cc852 test-mac-arm
[Finished] a83cc852 ursa-i9-9960x
[Finished] a83cc852 ursa-thinkcentre-m75q
[Finished] b55dd0e6 ec2-t3-xlarge-us-east-2
[Finished] b55dd0e6 test-mac-arm
[Finished] b55dd0e6 ursa-i9-9960x
[Finished] b55dd0e6 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

jorisvandenbossche pushed a commit that referenced this pull request Feb 28, 2023
…atch (#34294)

### Rationale for this change
Add the implementation of the Dataframe Interchange Protocol for `pyarrow.RecordBatch`. The protocol is already implemented for pyarrow.Table, see #14804.

### Are these changes tested?
Yes, tests are added to:

- python/pyarrow/tests/interchange/test_interchange_spec.py
- python/pyarrow/tests/interchange/test_conversion.py
* Closes: #33926

Authored-by: Alenka Frim <frim.alenka@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants