Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table #14804

Merged
merged 49 commits into from
Jan 13, 2023
Merged
Changes from 2 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
ca526a7
Produce a __dataframe__ object - squshed commits from #14613
AlenkaF Nov 30, 2022
d0ca2b1
Add column convert methods
AlenkaF Dec 1, 2022
c356cd1
Fix linter errors
AlenkaF Dec 1, 2022
8fb50c5
Add from_dataframe method details
AlenkaF Dec 5, 2022
4fab43b
Add tests for from_dataframe / pandas roundtrip
AlenkaF Dec 6, 2022
47ce2d6
Skip from_dataframe tests for older pandas versions
AlenkaF Dec 6, 2022
1c2955f
Add support for LargeStringArrays in Column class
AlenkaF Dec 12, 2022
0517a6d
Add test for uint and make changes to test_offset_of_sliced_array() a…
AlenkaF Dec 13, 2022
a7313fb
Prefix table metadata with pyarrow.
AlenkaF Dec 13, 2022
9583672
Update from_dataframe method
AlenkaF Dec 13, 2022
9e8733c
Try to add warnings to places where copies of data are being made
AlenkaF Dec 13, 2022
855ec8a
Update python/pyarrow/interchange/column.py
AlenkaF Dec 15, 2022
beec5aa
Expose from_dataframe in interchange/__init__.py
AlenkaF Dec 19, 2022
b11d84e
Add lost whitespace lines
AlenkaF Dec 19, 2022
6c9dce4
Revert commented categories in CategoricalDescription, column.py
AlenkaF Dec 19, 2022
8d91b67
Add _dtype attribute to __inti__ of the Column class and move all the…
AlenkaF Dec 19, 2022
4643f9b
Raise an error if nan_as_null=True
AlenkaF Dec 19, 2022
a93a46e
Linter corrections
AlenkaF Dec 19, 2022
0b231ea
Add better test coverage for test_mixed_dtypes and test_dtypes
AlenkaF Dec 19, 2022
d8ab902
Add better test coverage for test_pandas_roundtrip and add large_memo…
AlenkaF Dec 19, 2022
21af8fb
Add pyarrow roundtrip tests and make additional corrections to the co…
AlenkaF Dec 20, 2022
d6140d4
Correct large string handling and make smaller corrections in convert…
AlenkaF Dec 20, 2022
e0d1e63
Change dict arguments in protocol_df_chunk_to_pyarrow
AlenkaF Dec 21, 2022
6067fb3
Update dataframe.num_chunks() method to use to_batches
AlenkaF Dec 22, 2022
c6eb5f3
Check for sentinel values in the datetime more efficently
AlenkaF Dec 22, 2022
1a67177
Make bigger changes to how masks and arrays are constructed
AlenkaF Dec 22, 2022
51dcc49
Import from pandas.api.interchange
AlenkaF Dec 22, 2022
4879ef2
Add a check for use_nan, correct test using np.nan and put back check…
AlenkaF Dec 22, 2022
1cbd594
Add test coverage for pandas -> pyarrow conversion
AlenkaF Jan 4, 2023
a6b6e54
Rename test_extra.py to test_conversion.py
AlenkaF Jan 4, 2023
2e36185
Skip pandas -> pyarrow tests for older versions of pandas
AlenkaF Jan 4, 2023
4ca948d
Add test coverage for sliced table in pyarrow roundtrip
AlenkaF Jan 4, 2023
719ab88
Correct the handling of bitpacked booleans
AlenkaF Jan 5, 2023
91ea335
Small change in slicing parametrization
AlenkaF Jan 5, 2023
c74eb45
Add a RuntimeError for boolean and categorical columns in from_datafr…
AlenkaF Jan 5, 2023
c137337
Optimize datetime handling in from_dataframe
AlenkaF Jan 5, 2023
1e9cef9
Optimize buffers_to_array in from_dataframe.py
AlenkaF Jan 5, 2023
0c539a0
Apply suggestions from code review - Joris
AlenkaF Jan 10, 2023
6399be3
Add string column back to test_pandas_roundtrip for pandas versions 2…
AlenkaF Jan 10, 2023
9f68fe7
Fix linter error
AlenkaF Jan 10, 2023
b926066
Remove pandas specific comment for nan_as_null in dataframe.py
AlenkaF Jan 10, 2023
5c5d25e
Fix typo boolen -> categorical in categorical_column_to_dictionary
AlenkaF Jan 10, 2023
f2a65a6
Add a comment for float16 NotImplementedError in validity_buffer_nan_…
AlenkaF Jan 10, 2023
075e888
Update validity_buffer_nan_sentinel in python/pyarrow/interchange/fro…
AlenkaF Jan 10, 2023
efa12d6
Make change to the offset buffers part of buffers_to_array
AlenkaF Jan 10, 2023
858cadb
Linter correction
AlenkaF Jan 10, 2023
e937b4c
Update the handling of allow_copy keyword
AlenkaF Jan 10, 2023
1b5f248
Fix failing nightly test
AlenkaF Jan 12, 2023
9139444
Fix the fix for the failing test
AlenkaF Jan 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 37 additions & 9 deletions python/pyarrow/tests/interchange/test_conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,11 +171,6 @@ def test_pandas_roundtrip(uint, int, float, np_float):
"c": pa.array(np.array(arr, dtype=np_float), type=float),
}
)
if Version(pd.__version__) >= Version("2.0"):
# See https://github.com/pandas-dev/pandas/issues/50554
table["d"] = ["a", "", "c"]
# large string is not supported by pandas implementation

from pandas.api.interchange import (
from_dataframe as pandas_from_dataframe
)
Expand All @@ -192,6 +187,34 @@ def test_pandas_roundtrip(uint, int, float, np_float):
assert table_protocol.column_names() == result_protocol.column_names()


@pytest.mark.pandas
def test_roundtrip_pandas_string():
# See https://github.com/pandas-dev/pandas/issues/50554
if Version(pd.__version__) < Version("1.6"):
pytest.skip(" Column.size() called as a method in pandas 2.0.0")

# large string is not supported by pandas implementation
table = pa.table({"a": pa.array(["a", "", "c"])})

from pandas.api.interchange import (
from_dataframe as pandas_from_dataframe
)
pandas_df = pandas_from_dataframe(table)
result = pi.from_dataframe(pandas_df)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a assert table.equals(result) missing here (like there is in the test above)?

Copy link
Member Author

@AlenkaF AlenkaF Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to pandas defining int64 offset for what is in our case normal string, not large, the dtype that is at the end of the roundtrip becomes large_string. Due to that, the assertion is done with pylist for the values and separate for dtype (first inormal string, then large string).

assert result[0].to_pylist() == table[0].to_pylist()
assert pa.types.is_string(table[0].type)
assert pa.types.is_large_string(result[0].type)

table_protocol = table.__dataframe__()
result_protocol = result.__dataframe__()

assert table_protocol.num_columns() == result_protocol.num_columns()
assert table_protocol.num_rows() == result_protocol.num_rows()
assert table_protocol.num_chunks() == result_protocol.num_chunks()
assert table_protocol.column_names() == result_protocol.column_names()


@pytest.mark.pandas
def test_roundtrip_pandas_boolean():
if Version(pd.__version__) < Version("1.5.0"):
Expand Down Expand Up @@ -219,16 +242,21 @@ def test_roundtrip_pandas_boolean():
@pytest.mark.pandas
@pytest.mark.parametrize("unit", ['s', 'ms', 'us', 'ns'])
def test_roundtrip_pandas_datetime(unit):
# pandas < 2.0 always creates datetime64 in "ns"
# resolution, timezones are not yet supported in pandas

if Version(pd.__version__) < Version("1.5.0"):
pytest.skip("__dataframe__ added to pandas in 1.5.0")
from datetime import datetime as dt

# timezones not included as they are not yet supported in
# the pandas implementation
dt_arr = [dt(2007, 7, 13), dt(2007, 7, 14), dt(2007, 7, 15)]
table = pa.table({"a": pa.array(dt_arr, type=pa.timestamp(unit))})
expected = pa.table({"a": pa.array(dt_arr, type=pa.timestamp('ns'))})

if Version(pd.__version__) < Version("1.6"):
# pandas < 2.0 always creates datetime64 in "ns"
# resolution
expected = pa.table({"a": pa.array(dt_arr, type=pa.timestamp('ns'))})
else:
expected = table

from pandas.api.interchange import (
from_dataframe as pandas_from_dataframe
Expand Down