-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table #14613
Conversation
|
python/pyarrow/interchange/column.py
Outdated
# In case when the size of the chunk is such that the resulting | ||
# list is one less chunk then n_chunks -> append an empty chunk | ||
if i == n_chunks - 1: | ||
yield PyArrowColumn(pa.array([]), self._allow_copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example: selecting 5 chunks for an array of length 12. If chunk_size=2
, we get 6 chunks, if chunk_size=3
, we get 4 chunks =) So we end up producing 4 chunks with chunk_size=3
plus an empty chunk.
FWIW we have a compliance suite for interchange protocol adopters over at data-apis/dataframe-interchange-tests. It's a bit awkward to use as you'd have to write a compatibility layer in |
Thanks for the info @honno, the compliance suite is definitely something we have to use! |
…le.__dataframe__ and do some minor corrections
…t, float with missing values
…d necessary defenitions to separate implementation files
…andas timestamp in the tests
Will close this PR as I moved the work into another branch: #14804 |
Produce a
__dataframe__
objectDataFrame
,Column
andBuffers
classpa.Table
->pd.DataFrame
What should be looked into after the initial test:
Update: Columns without missing values are defined as non-nullable for now.
Update: casting boolean column/array to
uint8
solves this issue (boolean arrays are bit packed which is not supported by the protocol)Update: Bit-width for the offset buffer dtype must be set to 32 instead of 64.
Update: Pandas implementation seems to expect the column of categories to be an instance of PandasColumn instead of general
__dataframe__
column object.Update: Pandas implementation doesn't yet support bitmasks:
This code in the PR tested with pandas implementation as a consumer currently works with integers, floats, booleans, strings and timestamps without missing values:
Consume a
__dataframe__
objectfrom_dataframe
method