Skip to content

Commit f141709

Browse files
pitrouraulcd
authored andcommitted
GH-38607: [Python] Disable PyExtensionType autoload (#38608)
### Rationale for this change PyExtensionType autoload is really a misfeature. It creates PyArrow-specific extension types, though using ExtensionType is almost the same complexity while allowing deserialization from non-PyArrow software. ### What changes are included in this PR? * Disable PyExtensionType autoloading and deprecate PyExtensionType instantiation. * Update the docs to emphasize ExtensionType. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * Closes: #38607 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
1 parent 5a37e74 commit f141709

File tree

5 files changed

+377
-245
lines changed

5 files changed

+377
-245
lines changed

docs/source/python/extending_types.rst

+58-91
Original file line numberDiff line numberDiff line change
@@ -68,34 +68,43 @@ message).
6868
See the :ref:`format_metadata_extension_types` section of the metadata
6969
specification for more details.
7070

71-
Pyarrow allows you to define such extension types from Python.
72-
73-
There are currently two ways:
74-
75-
* Subclassing :class:`PyExtensionType`: the (de)serialization is based on pickle.
76-
This is a good option for an extension type that is only used from Python.
77-
* Subclassing :class:`ExtensionType`: this allows to give a custom
78-
Python-independent name and serialized metadata, that can potentially be
79-
recognized by other (non-Python) Arrow implementations such as PySpark.
71+
Pyarrow allows you to define such extension types from Python by subclassing
72+
:class:`ExtensionType` and giving the derived class its own extension name
73+
and serialization mechanism. The extension name and serialized metadata
74+
can potentially be recognized by other (non-Python) Arrow implementations
75+
such as PySpark.
8076

8177
For example, we could define a custom UUID type for 128-bit numbers which can
82-
be represented as ``FixedSizeBinary`` type with 16 bytes.
83-
Using the first approach, we create a ``UuidType`` subclass, and implement the
84-
``__reduce__`` method to ensure the class can be properly pickled::
78+
be represented as ``FixedSizeBinary`` type with 16 bytes::
8579

86-
class UuidType(pa.PyExtensionType):
80+
class UuidType(pa.ExtensionType):
8781

8882
def __init__(self):
89-
pa.PyExtensionType.__init__(self, pa.binary(16))
83+
super().__init__(pa.binary(16), "my_package.uuid")
84+
85+
def __arrow_ext_serialize__(self):
86+
# Since we don't have a parameterized type, we don't need extra
87+
# metadata to be deserialized
88+
return b''
9089

91-
def __reduce__(self):
92-
return UuidType, ()
90+
@classmethod
91+
def __arrow_ext_deserialize__(cls, storage_type, serialized):
92+
# Sanity checks, not required but illustrate the method signature.
93+
assert storage_type == pa.binary(16)
94+
assert serialized == b''
95+
# Return an instance of this subclass given the serialized
96+
# metadata.
97+
return UuidType()
98+
99+
The special methods ``__arrow_ext_serialize__`` and ``__arrow_ext_deserialize__``
100+
define the serialization of an extension type instance. For non-parametric
101+
types such as the above, the serialization payload can be left empty.
93102

94103
This can now be used to create arrays and tables holding the extension type::
95104

96105
>>> uuid_type = UuidType()
97106
>>> uuid_type.extension_name
98-
'arrow.py_extension_type'
107+
'my_package.uuid'
99108
>>> uuid_type.storage_type
100109
FixedSizeBinaryType(fixed_size_binary[16])
101110

@@ -112,8 +121,11 @@ This can now be used to create arrays and tables holding the extension type::
112121
]
113122

114123
This array can be included in RecordBatches, sent over IPC and received in
115-
another Python process. The custom UUID type will be preserved there, as long
116-
as the definition of the class is available (the type can be unpickled).
124+
another Python process. The receiving process must explicitly register the
125+
extension type for deserialization, otherwise it will fall back to the
126+
storage type::
127+
128+
>>> pa.register_extension_type(UuidType())
117129

118130
For example, creating a RecordBatch and writing it to a stream using the
119131
IPC protocol::
@@ -129,43 +141,12 @@ and then reading it back yields the proper type::
129141
>>> with pa.ipc.open_stream(buf) as reader:
130142
... result = reader.read_all()
131143
>>> result.column('ext').type
132-
UuidType(extension<arrow.py_extension_type>)
133-
134-
We can define the same type using the other option::
135-
136-
class UuidType(pa.ExtensionType):
137-
138-
def __init__(self):
139-
pa.ExtensionType.__init__(self, pa.binary(16), "my_package.uuid")
140-
141-
def __arrow_ext_serialize__(self):
142-
# since we don't have a parameterized type, we don't need extra
143-
# metadata to be deserialized
144-
return b''
145-
146-
@classmethod
147-
def __arrow_ext_deserialize__(self, storage_type, serialized):
148-
# return an instance of this subclass given the serialized
149-
# metadata.
150-
return UuidType()
151-
152-
This is a slightly longer implementation (you need to implement the special
153-
methods ``__arrow_ext_serialize__`` and ``__arrow_ext_deserialize__``), and the
154-
extension type needs to be registered to be received through IPC (using
155-
:func:`register_extension_type`), but it has
156-
now a unique name::
157-
158-
>>> uuid_type = UuidType()
159-
>>> uuid_type.extension_name
160-
'my_package.uuid'
161-
162-
>>> pa.register_extension_type(uuid_type)
144+
UuidType(FixedSizeBinaryType(fixed_size_binary[16]))
163145

164146
The receiving application doesn't need to be Python but can still recognize
165-
the extension type as a "uuid" type, if it has implemented its own extension
166-
type to receive it.
167-
If the type is not registered in the receiving application, it will fall back
168-
to the storage type.
147+
the extension type as a "my_package.uuid" type, if it has implemented its own
148+
extension type to receive it. If the type is not registered in the receiving
149+
application, it will fall back to the storage type.
169150

170151
Parameterized extension type
171152
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -187,7 +168,7 @@ of the given frequency since 1970.
187168
# attributes need to be set first before calling
188169
# super init (as that calls serialize)
189170
self._freq = freq
190-
pa.ExtensionType.__init__(self, pa.int64(), 'my_package.period')
171+
super().__init__(pa.int64(), 'my_package.period')
191172

192173
@property
193174
def freq(self):
@@ -198,7 +179,7 @@ of the given frequency since 1970.
198179

199180
@classmethod
200181
def __arrow_ext_deserialize__(cls, storage_type, serialized):
201-
# return an instance of this subclass given the serialized
182+
# Return an instance of this subclass given the serialized
202183
# metadata.
203184
serialized = serialized.decode()
204185
assert serialized.startswith("freq=")
@@ -209,31 +190,10 @@ Here, we ensure to store all information in the serialized metadata that is
209190
needed to reconstruct the instance (in the ``__arrow_ext_deserialize__`` class
210191
method), in this case the frequency string.
211192

212-
Note that, once created, the data type instance is considered immutable. If,
213-
in the example above, the ``freq`` parameter would change after instantiation,
214-
the reconstruction of the type instance after IPC will be incorrect.
193+
Note that, once created, the data type instance is considered immutable.
215194
In the example above, the ``freq`` parameter is therefore stored in a private
216195
attribute with a public read-only property to access it.
217196

218-
Parameterized extension types are also possible using the pickle-based type
219-
subclassing :class:`PyExtensionType`. The equivalent example for the period
220-
data type from above would look like::
221-
222-
class PeriodType(pa.PyExtensionType):
223-
224-
def __init__(self, freq):
225-
self._freq = freq
226-
pa.PyExtensionType.__init__(self, pa.int64())
227-
228-
@property
229-
def freq(self):
230-
return self._freq
231-
232-
def __reduce__(self):
233-
return PeriodType, (self.freq,)
234-
235-
Also the storage type does not need to be fixed but can be parameterized.
236-
237197
Custom extension array class
238198
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
239199

@@ -252,12 +212,16 @@ the data as a 2-D Numpy array ``(N, 3)`` without any copy::
252212
return self.storage.flatten().to_numpy().reshape((-1, 3))
253213

254214

255-
class Point3DType(pa.PyExtensionType):
215+
class Point3DType(pa.ExtensionType):
256216
def __init__(self):
257-
pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))
217+
super().__init__(pa.list_(pa.float32(), 3), "my_package.Point3DType")
258218

259-
def __reduce__(self):
260-
return Point3DType, ()
219+
def __arrow_ext_serialize__(self):
220+
return b''
221+
222+
@classmethod
223+
def __arrow_ext_deserialize__(cls, storage_type, serialized):
224+
return Point3DType()
261225

262226
def __arrow_ext_class__(self):
263227
return Point3DArray
@@ -289,11 +253,8 @@ The additional methods in the extension class are then available to the user::
289253

290254

291255
This array can be sent over IPC, received in another Python process, and the custom
292-
extension array class will be preserved (as long as the definitions of the classes above
293-
are available).
294-
295-
The same ``__arrow_ext_class__`` specialization can be used with custom types defined
296-
by subclassing :class:`ExtensionType`.
256+
extension array class will be preserved (as long as the receiving process registers
257+
the extension type using :func:`register_extension_type` before reading the IPC data).
297258

298259
Custom scalar conversion
299260
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -304,18 +265,24 @@ If you want scalars of your custom extension type to convert to a custom type wh
304265
For example, if we wanted the above example 3D point type to return a custom
305266
3D point class instead of a list, we would implement::
306267

268+
from collections import namedtuple
269+
307270
Point3D = namedtuple("Point3D", ["x", "y", "z"])
308271

309272
class Point3DScalar(pa.ExtensionScalar):
310273
def as_py(self) -> Point3D:
311274
return Point3D(*self.value.as_py())
312275

313-
class Point3DType(pa.PyExtensionType):
276+
class Point3DType(pa.ExtensionType):
314277
def __init__(self):
315-
pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))
278+
super().__init__(pa.list_(pa.float32(), 3), "my_package.Point3DType")
316279

317-
def __reduce__(self):
318-
return Point3DType, ()
280+
def __arrow_ext_serialize__(self):
281+
return b''
282+
283+
@classmethod
284+
def __arrow_ext_deserialize__(cls, storage_type, serialized):
285+
return Point3DType()
319286

320287
def __arrow_ext_scalar_class__(self):
321288
return Point3DScalar

python/pyarrow/tests/test_cffi.py

+40-8
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
# specific language governing permissions and limitations
1717
# under the License.
1818

19+
import contextlib
1920
import ctypes
2021
import gc
2122

@@ -51,18 +52,33 @@ def PyCapsule_IsValid(capsule, name):
5152
return ctypes.pythonapi.PyCapsule_IsValid(ctypes.py_object(capsule), name) == 1
5253

5354

54-
class ParamExtType(pa.PyExtensionType):
55+
@contextlib.contextmanager
56+
def registered_extension_type(ext_type):
57+
pa.register_extension_type(ext_type)
58+
try:
59+
yield
60+
finally:
61+
pa.unregister_extension_type(ext_type.extension_name)
62+
63+
64+
class ParamExtType(pa.ExtensionType):
5565

5666
def __init__(self, width):
5767
self._width = width
58-
pa.PyExtensionType.__init__(self, pa.binary(width))
68+
super().__init__(pa.binary(width),
69+
"pyarrow.tests.test_cffi.ParamExtType")
5970

6071
@property
6172
def width(self):
6273
return self._width
6374

64-
def __reduce__(self):
65-
return ParamExtType, (self.width,)
75+
def __arrow_ext_serialize__(self):
76+
return str(self.width).encode()
77+
78+
@classmethod
79+
def __arrow_ext_deserialize__(cls, storage_type, serialized):
80+
width = int(serialized.decode())
81+
return cls(width)
6682

6783

6884
def make_schema():
@@ -75,6 +91,12 @@ def make_extension_schema():
7591
metadata={b'key1': b'value1'})
7692

7793

94+
def make_extension_storage_schema():
95+
# Should be kept in sync with make_extension_schema
96+
return pa.schema([('ext', ParamExtType(3).storage_type)],
97+
metadata={b'key1': b'value1'})
98+
99+
78100
def make_batch():
79101
return pa.record_batch([[[1], [2, 42]]], make_schema())
80102

@@ -204,7 +226,10 @@ def test_export_import_array():
204226
pa.Array._import_from_c(ptr_array, ptr_schema)
205227

206228

207-
def check_export_import_schema(schema_factory):
229+
def check_export_import_schema(schema_factory, expected_schema_factory=None):
230+
if expected_schema_factory is None:
231+
expected_schema_factory = schema_factory
232+
208233
c_schema = ffi.new("struct ArrowSchema*")
209234
ptr_schema = int(ffi.cast("uintptr_t", c_schema))
210235

@@ -215,7 +240,7 @@ def check_export_import_schema(schema_factory):
215240
assert pa.total_allocated_bytes() > old_allocated
216241
# Delete and recreate C++ object from exported pointer
217242
schema_new = pa.Schema._import_from_c(ptr_schema)
218-
assert schema_new == schema_factory()
243+
assert schema_new == expected_schema_factory()
219244
assert pa.total_allocated_bytes() == old_allocated
220245
del schema_new
221246
assert pa.total_allocated_bytes() == old_allocated
@@ -240,7 +265,13 @@ def test_export_import_schema():
240265

241266
@needs_cffi
242267
def test_export_import_schema_with_extension():
243-
check_export_import_schema(make_extension_schema)
268+
# Extension type is unregistered => the storage type is imported
269+
check_export_import_schema(make_extension_schema,
270+
make_extension_storage_schema)
271+
272+
# Extension type is registered => the extension type is imported
273+
with registered_extension_type(ParamExtType(1)):
274+
check_export_import_schema(make_extension_schema)
244275

245276

246277
@needs_cffi
@@ -319,7 +350,8 @@ def test_export_import_batch():
319350

320351
@needs_cffi
321352
def test_export_import_batch_with_extension():
322-
check_export_import_batch(make_extension_batch)
353+
with registered_extension_type(ParamExtType(1)):
354+
check_export_import_batch(make_extension_batch)
323355

324356

325357
def _export_import_batch_reader(ptr_stream, reader_factory):

0 commit comments

Comments
 (0)