ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table #14613

AlenkaF · 2022-11-09T08:19:28Z

Produce a `dataframe` object

Implement the DataFrame, Column and Buffers class
Test pa.Table -> pd.DataFrame

What should be looked into after the initial test:

Data without missing values (produce a validity buffer in case of no missing values)
Update: Columns without missing values are defined as non-nullable for now.
Boolean values do not transfer correctly (only the first element is produced)
Update: casting boolean column/array to uint8 solves this issue (boolean arrays are bit packed which is not supported by the protocol)
Variable-length strings
Update: Bit-width for the offset buffer dtype must be set to 32 instead of 64.
DictionaryArray
Update: Pandas implementation seems to expect the column of categories to be an instance of PandasColumn instead of general __dataframe__ column object.

File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 185, in categorical_column_to_series
    assert isinstance(cat_column, PandasColumn), "categories must be a PandasColumn"
AssertionError: categories must be a PandasColumn

Bitmasks
Update: Pandas implementation doesn't yet support bitmasks:

File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 395, in buffer_to_ndarray
    raise NotImplementedError(f"Conversion for {dtype} is not yet supported.")
NotImplementedError: Conversion for (<DtypeKind.BOOL: 20>, 1, 'b', '=') is not yet supported.

This code in the PR tested with pandas implementation as a consumer currently works with integers, floats, booleans, strings and timestamps without missing values:

import pyarrow as pa
import pandas as pd
from datetime import datetime as dt

table = pa.table(
    {
        "a": [1, 2, 3, 4],  # dtype kind INT = 0
        "b": [3, 4, 5, 6],  # dtype kind INT = 0
        "c": [1.5, 2.5, 3.5, 4.5],  # dtype kind FLOAT = 2
        "d": [9, 10, 11, 12],  # dtype kind INT = 0
        "e": [True, True, False, False],  # dtype kind BOOLEAN = 20
        "f": ["a", "", "c", "d"],  # dtype kind STRING = 21
        "g": [dt(2007, 7, 13), dt(2007, 7, 14),
              dt(2007, 7, 15), dt(2007, 7, 16)] # dtype kind DATETIME = 22
    }
)

exchange_df = table.__dataframe__()
exchange_df._df

# pyarrow.Table
# a: int64
# b: int64
# c: double
# d: int64
# e: bool
# f: string
# g: timestamp[us]
# ----
# a: [[1,2,3,4]]
# b: [[3,4,5,6]]
# c: [[1.5,2.5,3.5,4.5]]
# d: [[9,10,11,12]]
# e: [[true,true,false,false]]
# f: [["a","","c","d"]]
# g: [[2007-07-13 00:00:00.000000,2007-07-14 00:00:00.000000,2007-07-15 00:00:00.000000,2007-07-16 00:00:00.000000]]

from pandas.core.interchange.from_dataframe import from_dataframe
from_dataframe(exchange_df)
   a  b    c   d  e  f          g
0  1  3  1.5   9  1  a 2007-07-13
1  2  4  2.5  10  1    2007-07-14
2  3  5  3.5  11  0  c 2007-07-15
3  4  6  4.5  12  0  d 2007-07-16

Consume a `dataframe` object

Implement from_dataframe method

github-actions · 2022-11-09T08:19:48Z

https://issues.apache.org/jira/browse/ARROW-18152

github-actions · 2022-11-09T08:19:50Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

python/pyarrow/interchange/column.py

AlenkaF · 2022-11-09T08:34:06Z

python/pyarrow/interchange/column.py

+            # In case when the size of the chunk is such that the resulting
+            # list is one less chunk then n_chunks -> append an empty chunk
+            if i == n_chunks - 1:
+                yield PyArrowColumn(pa.array([]), self._allow_copy)


Example: selecting 5 chunks for an array of length 12. If chunk_size=2, we get 6 chunks, if chunk_size=3, we get 4 chunks =) So we end up producing 4 chunks with chunk_size=3 plus an empty chunk.

python/pyarrow/interchange/dataframe.py

honno · 2022-11-11T09:41:33Z

FWIW we have a compliance suite for interchange protocol adopters over at data-apis/dataframe-interchange-tests. It's a bit awkward to use as you'd have to write a compatibility layer in wrappers.py, but might be interesting. (Cool to see you're working on this!)

AlenkaF · 2022-11-14T10:00:20Z

Thanks for the info @honno, the compliance suite is definitely something we have to use!
(Cool to see you continue working on the data API standards! 😉 )

…le.__dataframe__ and do some minor corrections

…t, float with missing values

…d necessary defenitions to separate implementation files

…to n_chunks

…andas timestamp in the tests

…ns < 1.5.0

AlenkaF · 2022-12-01T15:01:18Z

Will close this PR as I moved the work into another branch: #14804

github-actions bot added the Component: Python label Nov 9, 2022

AlenkaF commented Nov 9, 2022

View reviewed changes

python/pyarrow/interchange/column.py Outdated Show resolved Hide resolved

AlenkaF commented Nov 9, 2022

View reviewed changes

python/pyarrow/interchange/column.py Outdated Show resolved Hide resolved

AlenkaF commented Nov 9, 2022

View reviewed changes

python/pyarrow/interchange/dataframe.py Show resolved Hide resolved

AlenkaF force-pushed the ARROW-18152 branch from 157d0f5 to 2d7677b Compare November 15, 2022 10:56

This was referenced Nov 24, 2022

BUG: interchange bitmasks not supported in interchange/from_dataframe.py pandas-dev/pandas#49888

Closed

BUG: interchange categorical_column_to_series() should not accept only PandasColumn pandas-dev/pandas#49889

Closed

AlenkaF added 18 commits November 28, 2022 09:35

Initial sceleton for interchange package

854e114

Add a dataframe (PyArrowTableXchg) class methods

010d9a8

Add a subpackage for testing interchange protocol, add a test for Tab…

c0af309

…le.__dataframe__ and do some minor corrections

Add column (PyArrowColumn) class methods

842ba3e

Add buffer (PyArrowBuffer) class methods, some changes and main tests

61eb00f

Make changes to buffer, column and dataframe classes

027012d

Make changes to from_dataframe.py skeleton

6f746fb

Add extra tests and make minor corrections

1669224

Run linter

473414e

Make changes to the code to make pa.Table -> pd.DataFrame work for in…

cba4374

…t, float with missing values

Correct linter error and add a check for TypedDict import

c021451

Use len(...) for the size of the pa.Array/pa.ChunkedArray

7e1e6bd

Add missing annotations import and remove TypedDict leftover

df9b24b

Remove bool bit_width check

494ffbc

Change buffer representation of boolean arrays

784d178

Remove dataframe protocol abstract classes and move the docstrings an…

33784da

…d necessary defenitions to separate implementation files

Add missing changes to the class names and references

2860911

Add ColumnNullType = non nullable for columns without missing values

92a1765

AlenkaF added 11 commits November 28, 2022 09:35

Correct test error after describe_null() change

95f7f45

Change DtypeKind to be imported from column.py

964e9da

Add change for string dtype and bitmask - not sure about it though

3658088

Add a change for dictionary arrays

caefeed

Add corrections for timestamp dtype

8871d11

Change size() to size

ad9b2e8

Add schema to empty record batch and keep the number of chukes fixed …

2b83dd8

…to n_chunks

Add offset for sliced array with a test and use datetime instead of p…

4f150ef

…andas timestamp in the tests

Fix linter errors

1a456fe

Add a skip for the test using from_dataframe() added in pandas versio…

2632c55

…ns < 1.5.0

Make changes to the from_dataframe.py skeleton

f177b15

AlenkaF force-pushed the ARROW-18152 branch from da7b4b4 to f177b15 Compare November 28, 2022 08:36

AlenkaF added a commit to AlenkaF/arrow that referenced this pull request Dec 1, 2022

Produce a __dataframe__ object - squshed commits from apache#14613

fb26f21

AlenkaF mentioned this pull request Dec 1, 2022

GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table #14804

Merged

AlenkaF closed this Dec 1, 2022

asfimport mentioned this pull request Jan 10, 2023

[Python] DataFrame Interchange Protocol for pyarrow Table #33346

Closed

AlenkaF added a commit to AlenkaF/arrow that referenced this pull request Jan 11, 2023

Produce a __dataframe__ object - squshed commits from apache#14613

ca526a7

AlenkaF deleted the ARROW-18152 branch June 5, 2023 07:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table #14613

ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table #14613

AlenkaF commented Nov 9, 2022 •

edited

Loading

github-actions bot commented Nov 9, 2022

github-actions bot commented Nov 9, 2022

AlenkaF Nov 9, 2022

honno commented Nov 11, 2022

AlenkaF commented Nov 14, 2022

AlenkaF commented Dec 1, 2022

ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table #14613

ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table #14613

Conversation

AlenkaF commented Nov 9, 2022 • edited Loading

Produce a __dataframe__ object

Consume a __dataframe__ object

github-actions bot commented Nov 9, 2022

github-actions bot commented Nov 9, 2022

AlenkaF Nov 9, 2022

Choose a reason for hiding this comment

honno commented Nov 11, 2022

AlenkaF commented Nov 14, 2022

AlenkaF commented Dec 1, 2022

AlenkaF commented Nov 9, 2022 •

edited

Loading

Produce a `dataframe` object

Consume a `dataframe` object