-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table #14804
GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table #14804
Conversation
|
One more thing I will try to look at asap is the copies being made in the implementation and will add warnings in case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(partial review)
if sys.version_info >= (3, 8): | ||
from typing import TypedDict | ||
else: | ||
from typing_extensions import TypedDict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a third party package that needs to be installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is part of the Python standard library: https://docs.python.org/3/library/typing.html.
Looking at it typing
should be supported for versions 3.5 and up. Will change the check.
python/pyarrow/interchange/column.py
Outdated
) | ||
i += 1 | ||
|
||
elif isinstance(self._col, pa.ChunkedArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above you combined the chunks when passing a ChunkedArray to the constructor, so that means self._col
can currently never be a ChunkedArray?
But maybe we should try to support that and not always upfront call combine_chunks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above you combined the chunks when passing a ChunkedArray to the constructor, so that means self._col can currently never be a ChunkedArray?
Oh, I missed this part here when moving the combining of chunks to the constructor in 71ca596
Will correct.
But maybe we should try to support that and not always upfront call combine_chunks?
Just to summarize what we talked about (will add it to the code as a comment): when the dataframe is being consumed with from_dataframe
method it iterates through the chunks of the dataframe:
for chunk in df.get_chunks(): |
arrow/python/pyarrow/interchange/dataframe.py
Lines 161 to 190 in 71ca596
def get_chunks( | |
self, n_chunks: Optional[int] = None | |
) -> Iterable[_PyArrowDataFrame]: | |
""" | |
Return an iterator yielding the chunks. | |
By default (None), yields the chunks that the data is stored as by the | |
producer. If given, ``n_chunks`` must be a multiple of | |
``self.num_chunks()``, meaning the producer must subdivide each chunk | |
before yielding it. | |
Note that the producer must ensure that all columns are chunked the | |
same way. | |
""" | |
if n_chunks and n_chunks > 1: | |
chunk_size = self.num_rows() // n_chunks | |
if self.num_rows() % n_chunks != 0: | |
chunk_size += 1 | |
batches = self._df.to_batches(max_chunksize=chunk_size) | |
# In case when the size of the chunk is such that the resulting | |
# list is one less chunk then n_chunks -> append an empty chunk | |
if len(batches) == n_chunks - 1: | |
batches.append(pa.record_batch([[]], schema=self._df.schema)) | |
else: | |
batches = self._df.to_batches() | |
iterator_tables = [_PyArrowDataFrame( | |
pa.Table.from_batches([batch]), self._nan_as_null, self._allow_copy | |
) | |
for batch in batches | |
] | |
return iterator_tables |
and so is always converting part of the dataframe that is not chunked. For that reason there is no need to support chunking on the column level as it is only possible to get a chunked array if calling Column class directly, not through the DataFrame class.
@jorisvandenbossche I think I addressed all the comments and topics we have talked about. Some information about the recent changes:
So I think this PR is ready for another round of review 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Continuing my review (now mostly looked at from_dataframe
)
@jorisvandenbossche I have gone through all of the suggestions, I think this PR is ready for another round of review. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- String dtype is removed from pandas roundtrip tests as pandas defines
.size()
as a method in column.py but calls it as a property in from_dataframe.py and so the roundtrip with pandas errors for string dtypes.
It should be possible to add this back now (since it is fixed in pandas main), but skip the string dtype depending on the pandas version (only the 2.0.0.dev version runs the test)
@github-actions crossbow submit test-conda-python-3.9-pandas-upstream_devel |
|
@github-actions crossbow submit test-conda-python-3.8-pandas-nightly |
Revision: 04d6f3b Submitted crossbow builds: ursacomputing/crossbow @ actions-693c58f0a7
|
@jorisvandenbossche the code review suggestions are all addressed. If I am not mistaken you still want to review the tests? |
@github-actions crossbow submit test-conda-python-3.8-pandas-nightly |
Revision: 1fe490c Submitted crossbow builds: ursacomputing/crossbow @ actions-86fb33df92
|
…m_dataframe.py Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
1fe490c
to
e937b4c
Compare
@github-actions crossbow submit test-conda-python-3.8-pandas-nightly |
1 similar comment
@github-actions crossbow submit test-conda-python-3.8-pandas-nightly |
Revision: e937b4c Submitted crossbow builds: ursacomputing/crossbow @ actions-2377096f27
|
@github-actions crossbow submit test-conda-python-3.8-pandas-nightly |
Revision: 1b5f248 Submitted crossbow builds: ursacomputing/crossbow @ actions-0169982f7e
|
@github-actions crossbow submit test-conda-python-3.8-pandas-nightly |
Revision: 9139444 Submitted crossbow builds: ursacomputing/crossbow @ actions-17d90bfc40
|
) | ||
pandas_df = pandas_from_dataframe(table) | ||
result = pi.from_dataframe(pandas_df) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a assert table.equals(result)
missing here (like there is in the test above)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to pandas defining int64
offset for what is in our case normal string, not large, the dtype that is at the end of the roundtrip becomes large_string
. Due to that, the assertion is done with pylist for the values and separate for dtype (first inormal string, then large string).
@jorisvandenbossche do you have any blocking issues I can correct today to make this PR merged before the release freeze? Hope the answer to the question about assertions on |
Great to see this merged! |
Benchmark runs are scheduled for baseline = b55dd0e and contender = a83cc85. a83cc85 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…atch (#34294) ### Rationale for this change Add the implementation of the Dataframe Interchange Protocol for `pyarrow.RecordBatch`. The protocol is already implemented for pyarrow.Table, see #14804. ### Are these changes tested? Yes, tests are added to: - python/pyarrow/tests/interchange/test_interchange_spec.py - python/pyarrow/tests/interchange/test_conversion.py * Closes: #33926 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
This PR implements the Dataframe Interchange Protocol for
pyarrow.Table
.See: https://data-apis.org/dataframe-protocol/latest/index.html