Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table #14613

Closed
wants to merge 29 commits into from

Conversation

AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Nov 9, 2022

Produce a __dataframe__ object

  • Implement the DataFrame, Column and Buffers class
  • Test pa.Table -> pd.DataFrame

What should be looked into after the initial test:

  • Data without missing values (produce a validity buffer in case of no missing values)
    Update: Columns without missing values are defined as non-nullable for now.
  • Boolean values do not transfer correctly (only the first element is produced)
    Update: casting boolean column/array to uint8 solves this issue (boolean arrays are bit packed which is not supported by the protocol)
  • Variable-length strings
    Update: Bit-width for the offset buffer dtype must be set to 32 instead of 64.
  • DictionaryArray
    Update: Pandas implementation seems to expect the column of categories to be an instance of PandasColumn instead of general __dataframe__ column object.
File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 185, in categorical_column_to_series
    assert isinstance(cat_column, PandasColumn), "categories must be a PandasColumn"
AssertionError: categories must be a PandasColumn
  • Bitmasks
    Update: Pandas implementation doesn't yet support bitmasks:
File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 395, in buffer_to_ndarray
    raise NotImplementedError(f"Conversion for {dtype} is not yet supported.")
NotImplementedError: Conversion for (<DtypeKind.BOOL: 20>, 1, 'b', '=') is not yet supported.

This code in the PR tested with pandas implementation as a consumer currently works with integers, floats, booleans, strings and timestamps without missing values:

import pyarrow as pa
import pandas as pd
from datetime import datetime as dt

table = pa.table(
    {
        "a": [1, 2, 3, 4],  # dtype kind INT = 0
        "b": [3, 4, 5, 6],  # dtype kind INT = 0
        "c": [1.5, 2.5, 3.5, 4.5],  # dtype kind FLOAT = 2
        "d": [9, 10, 11, 12],  # dtype kind INT = 0
        "e": [True, True, False, False],  # dtype kind BOOLEAN = 20
        "f": ["a", "", "c", "d"],  # dtype kind STRING = 21
        "g": [dt(2007, 7, 13), dt(2007, 7, 14),
              dt(2007, 7, 15), dt(2007, 7, 16)] # dtype kind DATETIME = 22
    }
)

exchange_df = table.__dataframe__()
exchange_df._df

# pyarrow.Table
# a: int64
# b: int64
# c: double
# d: int64
# e: bool
# f: string
# g: timestamp[us]
# ----
# a: [[1,2,3,4]]
# b: [[3,4,5,6]]
# c: [[1.5,2.5,3.5,4.5]]
# d: [[9,10,11,12]]
# e: [[true,true,false,false]]
# f: [["a","","c","d"]]
# g: [[2007-07-13 00:00:00.000000,2007-07-14 00:00:00.000000,2007-07-15 00:00:00.000000,2007-07-16 00:00:00.000000]]

from pandas.core.interchange.from_dataframe import from_dataframe
from_dataframe(exchange_df)
   a  b    c   d  e  f          g
0  1  3  1.5   9  1  a 2007-07-13
1  2  4  2.5  10  1    2007-07-14
2  3  5  3.5  11  0  c 2007-07-15
3  4  6  4.5  12  0  d 2007-07-16

Consume a __dataframe__ object

  • Implement from_dataframe method

@github-actions
Copy link

github-actions bot commented Nov 9, 2022

@github-actions
Copy link

github-actions bot commented Nov 9, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Comment on lines 255 to 258
# In case when the size of the chunk is such that the resulting
# list is one less chunk then n_chunks -> append an empty chunk
if i == n_chunks - 1:
yield PyArrowColumn(pa.array([]), self._allow_copy)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example: selecting 5 chunks for an array of length 12. If chunk_size=2, we get 6 chunks, if chunk_size=3, we get 4 chunks =) So we end up producing 4 chunks with chunk_size=3 plus an empty chunk.

@honno
Copy link

honno commented Nov 11, 2022

FWIW we have a compliance suite for interchange protocol adopters over at data-apis/dataframe-interchange-tests. It's a bit awkward to use as you'd have to write a compatibility layer in wrappers.py, but might be interesting. (Cool to see you're working on this!)

@AlenkaF
Copy link
Member Author

AlenkaF commented Nov 14, 2022

Thanks for the info @honno, the compliance suite is definitely something we have to use!
(Cool to see you continue working on the data API standards! 😉 )

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 1, 2022

Will close this PR as I moved the work into another branch: #14804

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants