Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33346: [Python] DataFrame Interchange Protocol for pyarrow Table #14804

Merged
merged 49 commits into from
Jan 13, 2023
Merged
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
ca526a7
Produce a __dataframe__ object - squshed commits from #14613
AlenkaF Nov 30, 2022
d0ca2b1
Add column convert methods
AlenkaF Dec 1, 2022
c356cd1
Fix linter errors
AlenkaF Dec 1, 2022
8fb50c5
Add from_dataframe method details
AlenkaF Dec 5, 2022
4fab43b
Add tests for from_dataframe / pandas roundtrip
AlenkaF Dec 6, 2022
47ce2d6
Skip from_dataframe tests for older pandas versions
AlenkaF Dec 6, 2022
1c2955f
Add support for LargeStringArrays in Column class
AlenkaF Dec 12, 2022
0517a6d
Add test for uint and make changes to test_offset_of_sliced_array() a…
AlenkaF Dec 13, 2022
a7313fb
Prefix table metadata with pyarrow.
AlenkaF Dec 13, 2022
9583672
Update from_dataframe method
AlenkaF Dec 13, 2022
9e8733c
Try to add warnings to places where copies of data are being made
AlenkaF Dec 13, 2022
855ec8a
Update python/pyarrow/interchange/column.py
AlenkaF Dec 15, 2022
beec5aa
Expose from_dataframe in interchange/__init__.py
AlenkaF Dec 19, 2022
b11d84e
Add lost whitespace lines
AlenkaF Dec 19, 2022
6c9dce4
Revert commented categories in CategoricalDescription, column.py
AlenkaF Dec 19, 2022
8d91b67
Add _dtype attribute to __inti__ of the Column class and move all the…
AlenkaF Dec 19, 2022
4643f9b
Raise an error if nan_as_null=True
AlenkaF Dec 19, 2022
a93a46e
Linter corrections
AlenkaF Dec 19, 2022
0b231ea
Add better test coverage for test_mixed_dtypes and test_dtypes
AlenkaF Dec 19, 2022
d8ab902
Add better test coverage for test_pandas_roundtrip and add large_memo…
AlenkaF Dec 19, 2022
21af8fb
Add pyarrow roundtrip tests and make additional corrections to the co…
AlenkaF Dec 20, 2022
d6140d4
Correct large string handling and make smaller corrections in convert…
AlenkaF Dec 20, 2022
e0d1e63
Change dict arguments in protocol_df_chunk_to_pyarrow
AlenkaF Dec 21, 2022
6067fb3
Update dataframe.num_chunks() method to use to_batches
AlenkaF Dec 22, 2022
c6eb5f3
Check for sentinel values in the datetime more efficently
AlenkaF Dec 22, 2022
1a67177
Make bigger changes to how masks and arrays are constructed
AlenkaF Dec 22, 2022
51dcc49
Import from pandas.api.interchange
AlenkaF Dec 22, 2022
4879ef2
Add a check for use_nan, correct test using np.nan and put back check…
AlenkaF Dec 22, 2022
1cbd594
Add test coverage for pandas -> pyarrow conversion
AlenkaF Jan 4, 2023
a6b6e54
Rename test_extra.py to test_conversion.py
AlenkaF Jan 4, 2023
2e36185
Skip pandas -> pyarrow tests for older versions of pandas
AlenkaF Jan 4, 2023
4ca948d
Add test coverage for sliced table in pyarrow roundtrip
AlenkaF Jan 4, 2023
719ab88
Correct the handling of bitpacked booleans
AlenkaF Jan 5, 2023
91ea335
Small change in slicing parametrization
AlenkaF Jan 5, 2023
c74eb45
Add a RuntimeError for boolean and categorical columns in from_datafr…
AlenkaF Jan 5, 2023
c137337
Optimize datetime handling in from_dataframe
AlenkaF Jan 5, 2023
1e9cef9
Optimize buffers_to_array in from_dataframe.py
AlenkaF Jan 5, 2023
0c539a0
Apply suggestions from code review - Joris
AlenkaF Jan 10, 2023
6399be3
Add string column back to test_pandas_roundtrip for pandas versions 2…
AlenkaF Jan 10, 2023
9f68fe7
Fix linter error
AlenkaF Jan 10, 2023
b926066
Remove pandas specific comment for nan_as_null in dataframe.py
AlenkaF Jan 10, 2023
5c5d25e
Fix typo boolen -> categorical in categorical_column_to_dictionary
AlenkaF Jan 10, 2023
f2a65a6
Add a comment for float16 NotImplementedError in validity_buffer_nan_…
AlenkaF Jan 10, 2023
075e888
Update validity_buffer_nan_sentinel in python/pyarrow/interchange/fro…
AlenkaF Jan 10, 2023
efa12d6
Make change to the offset buffers part of buffers_to_array
AlenkaF Jan 10, 2023
858cadb
Linter correction
AlenkaF Jan 10, 2023
e937b4c
Update the handling of allow_copy keyword
AlenkaF Jan 10, 2023
1b5f248
Fix failing nightly test
AlenkaF Jan 12, 2023
9139444
Fix the fix for the failing test
AlenkaF Jan 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions python/pyarrow/interchange/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
AlenkaF marked this conversation as resolved.
Show resolved Hide resolved

# flake8: noqa

from .from_dataframe import from_dataframe
107 changes: 107 additions & 0 deletions python/pyarrow/interchange/buffer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

from __future__ import annotations
import enum

import pyarrow as pa


class DlpackDeviceType(enum.IntEnum):
"""Integer enum for device type codes matching DLPack."""

CPU = 1
CUDA = 2
CPU_PINNED = 3
OPENCL = 4
VULKAN = 7
METAL = 8
VPI = 9
ROCM = 10


class _PyArrowBuffer:
"""
Data in the buffer is guaranteed to be contiguous in memory.

Note that there is no dtype attribute present, a buffer can be thought of
as simply a block of memory. However, if the column that the buffer is
attached to has a dtype that's supported by DLPack and ``__dlpack__`` is
implemented, then that dtype information will be contained in the return
value from ``__dlpack__``.

This distinction is useful to support both data exchange via DLPack on a
buffer and (b) dtypes like variable-length strings which do not have a
fixed number of bytes per element.
"""

def __init__(self, x: pa.Buffer, allow_copy: bool = True) -> None:
"""
Handle PyArrow Buffers.
"""
self._x = x

@property
def bufsize(self) -> int:
"""
Buffer size in bytes.
"""
return self._x.size

@property
def ptr(self) -> int:
"""
Pointer to start of the buffer as an integer.
"""
return self._x.address

def __dlpack__(self):
"""
Produce DLPack capsule (see array API standard).

Raises:
- TypeError : if the buffer contains unsupported dtypes.
- NotImplementedError : if DLPack support is not implemented

Useful to have to connect to array libraries. Support optional because
it's not completely trivial to implement for a Python-only library.
"""
raise NotImplementedError("__dlpack__")

def __dlpack_device__(self) -> tuple[DlpackDeviceType, int | None]:
"""
Device type and device ID for where the data in the buffer resides.
Uses device type codes matching DLPack.
Note: must be implemented even if ``__dlpack__`` is not.
"""
if self._x.is_cpu:
return (DlpackDeviceType.CPU, None)
else:
raise NotImplementedError("__dlpack_device__")

def __repr__(self) -> str:
return (
"PyArrowBuffer(" +
str(
{
"bufsize": self.bufsize,
"ptr": self.ptr,
"device": self.__dlpack_device__()[0].name,
}
) +
")"
)
Loading