@@ -68,34 +68,43 @@ message).
68
68
See the :ref: `format_metadata_extension_types ` section of the metadata
69
69
specification for more details.
70
70
71
- Pyarrow allows you to define such extension types from Python.
72
-
73
- There are currently two ways:
74
-
75
- * Subclassing :class: `PyExtensionType `: the (de)serialization is based on pickle.
76
- This is a good option for an extension type that is only used from Python.
77
- * Subclassing :class: `ExtensionType `: this allows to give a custom
78
- Python-independent name and serialized metadata, that can potentially be
79
- recognized by other (non-Python) Arrow implementations such as PySpark.
71
+ Pyarrow allows you to define such extension types from Python by subclassing
72
+ :class: `ExtensionType ` and giving the derived class its own extension name
73
+ and serialization mechanism. The extension name and serialized metadata
74
+ can potentially be recognized by other (non-Python) Arrow implementations
75
+ such as PySpark.
80
76
81
77
For example, we could define a custom UUID type for 128-bit numbers which can
82
- be represented as ``FixedSizeBinary `` type with 16 bytes.
83
- Using the first approach, we create a ``UuidType `` subclass, and implement the
84
- ``__reduce__ `` method to ensure the class can be properly pickled::
78
+ be represented as ``FixedSizeBinary `` type with 16 bytes::
85
79
86
- class UuidType(pa.PyExtensionType ):
80
+ class UuidType(pa.ExtensionType ):
87
81
88
82
def __init__(self):
89
- pa.PyExtensionType.__init__(self, pa.binary(16))
83
+ super().__init__(pa.binary(16), "my_package.uuid")
84
+
85
+ def __arrow_ext_serialize__(self):
86
+ # Since we don't have a parameterized type, we don't need extra
87
+ # metadata to be deserialized
88
+ return b''
90
89
91
- def __reduce__(self):
92
- return UuidType, ()
90
+ @classmethod
91
+ def __arrow_ext_deserialize__(cls, storage_type, serialized):
92
+ # Sanity checks, not required but illustrate the method signature.
93
+ assert storage_type == pa.binary(16)
94
+ assert serialized == b''
95
+ # Return an instance of this subclass given the serialized
96
+ # metadata.
97
+ return UuidType()
98
+
99
+ The special methods ``__arrow_ext_serialize__ `` and ``__arrow_ext_deserialize__ ``
100
+ define the serialization of an extension type instance. For non-parametric
101
+ types such as the above, the serialization payload can be left empty.
93
102
94
103
This can now be used to create arrays and tables holding the extension type::
95
104
96
105
>>> uuid_type = UuidType()
97
106
>>> uuid_type.extension_name
98
- 'arrow.py_extension_type '
107
+ 'my_package.uuid '
99
108
>>> uuid_type.storage_type
100
109
FixedSizeBinaryType(fixed_size_binary[16])
101
110
@@ -112,8 +121,11 @@ This can now be used to create arrays and tables holding the extension type::
112
121
]
113
122
114
123
This array can be included in RecordBatches, sent over IPC and received in
115
- another Python process. The custom UUID type will be preserved there, as long
116
- as the definition of the class is available (the type can be unpickled).
124
+ another Python process. The receiving process must explicitly register the
125
+ extension type for deserialization, otherwise it will fall back to the
126
+ storage type::
127
+
128
+ >>> pa.register_extension_type(UuidType())
117
129
118
130
For example, creating a RecordBatch and writing it to a stream using the
119
131
IPC protocol::
@@ -129,43 +141,12 @@ and then reading it back yields the proper type::
129
141
>>> with pa.ipc.open_stream(buf) as reader:
130
142
... result = reader.read_all()
131
143
>>> result.column('ext').type
132
- UuidType(extension<arrow.py_extension_type>)
133
-
134
- We can define the same type using the other option::
135
-
136
- class UuidType(pa.ExtensionType):
137
-
138
- def __init__(self):
139
- pa.ExtensionType.__init__(self, pa.binary(16), "my_package.uuid")
140
-
141
- def __arrow_ext_serialize__(self):
142
- # since we don't have a parameterized type, we don't need extra
143
- # metadata to be deserialized
144
- return b''
145
-
146
- @classmethod
147
- def __arrow_ext_deserialize__(self, storage_type, serialized):
148
- # return an instance of this subclass given the serialized
149
- # metadata.
150
- return UuidType()
151
-
152
- This is a slightly longer implementation (you need to implement the special
153
- methods ``__arrow_ext_serialize__ `` and ``__arrow_ext_deserialize__ ``), and the
154
- extension type needs to be registered to be received through IPC (using
155
- :func: `register_extension_type `), but it has
156
- now a unique name::
157
-
158
- >>> uuid_type = UuidType()
159
- >>> uuid_type.extension_name
160
- 'my_package.uuid'
161
-
162
- >>> pa.register_extension_type(uuid_type)
144
+ UuidType(FixedSizeBinaryType(fixed_size_binary[16]))
163
145
164
146
The receiving application doesn't need to be Python but can still recognize
165
- the extension type as a "uuid" type, if it has implemented its own extension
166
- type to receive it.
167
- If the type is not registered in the receiving application, it will fall back
168
- to the storage type.
147
+ the extension type as a "my_package.uuid" type, if it has implemented its own
148
+ extension type to receive it. If the type is not registered in the receiving
149
+ application, it will fall back to the storage type.
169
150
170
151
Parameterized extension type
171
152
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -187,7 +168,7 @@ of the given frequency since 1970.
187
168
# attributes need to be set first before calling
188
169
# super init (as that calls serialize)
189
170
self._freq = freq
190
- pa.ExtensionType. __init__(self, pa.int64(), 'my_package.period')
171
+ super(). __init__(pa.int64(), 'my_package.period')
191
172
192
173
@property
193
174
def freq(self):
@@ -198,7 +179,7 @@ of the given frequency since 1970.
198
179
199
180
@classmethod
200
181
def __arrow_ext_deserialize__(cls, storage_type, serialized):
201
- # return an instance of this subclass given the serialized
182
+ # Return an instance of this subclass given the serialized
202
183
# metadata.
203
184
serialized = serialized.decode()
204
185
assert serialized.startswith("freq=")
@@ -209,31 +190,10 @@ Here, we ensure to store all information in the serialized metadata that is
209
190
needed to reconstruct the instance (in the ``__arrow_ext_deserialize__ `` class
210
191
method), in this case the frequency string.
211
192
212
- Note that, once created, the data type instance is considered immutable. If,
213
- in the example above, the ``freq `` parameter would change after instantiation,
214
- the reconstruction of the type instance after IPC will be incorrect.
193
+ Note that, once created, the data type instance is considered immutable.
215
194
In the example above, the ``freq `` parameter is therefore stored in a private
216
195
attribute with a public read-only property to access it.
217
196
218
- Parameterized extension types are also possible using the pickle-based type
219
- subclassing :class: `PyExtensionType `. The equivalent example for the period
220
- data type from above would look like::
221
-
222
- class PeriodType(pa.PyExtensionType):
223
-
224
- def __init__(self, freq):
225
- self._freq = freq
226
- pa.PyExtensionType.__init__(self, pa.int64())
227
-
228
- @property
229
- def freq(self):
230
- return self._freq
231
-
232
- def __reduce__(self):
233
- return PeriodType, (self.freq,)
234
-
235
- Also the storage type does not need to be fixed but can be parameterized.
236
-
237
197
Custom extension array class
238
198
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
239
199
@@ -252,12 +212,16 @@ the data as a 2-D Numpy array ``(N, 3)`` without any copy::
252
212
return self.storage.flatten().to_numpy().reshape((-1, 3))
253
213
254
214
255
- class Point3DType(pa.PyExtensionType ):
215
+ class Point3DType(pa.ExtensionType ):
256
216
def __init__(self):
257
- pa.PyExtensionType. __init__(self, pa.list_(pa.float32(), 3))
217
+ super(). __init__(pa.list_(pa.float32(), 3), "my_package.Point3DType" )
258
218
259
- def __reduce__(self):
260
- return Point3DType, ()
219
+ def __arrow_ext_serialize__(self):
220
+ return b''
221
+
222
+ @classmethod
223
+ def __arrow_ext_deserialize__(cls, storage_type, serialized):
224
+ return Point3DType()
261
225
262
226
def __arrow_ext_class__(self):
263
227
return Point3DArray
@@ -289,11 +253,8 @@ The additional methods in the extension class are then available to the user::
289
253
290
254
291
255
This array can be sent over IPC, received in another Python process, and the custom
292
- extension array class will be preserved (as long as the definitions of the classes above
293
- are available).
294
-
295
- The same ``__arrow_ext_class__ `` specialization can be used with custom types defined
296
- by subclassing :class: `ExtensionType `.
256
+ extension array class will be preserved (as long as the receiving process registers
257
+ the extension type using :func: `register_extension_type ` before reading the IPC data).
297
258
298
259
Custom scalar conversion
299
260
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -304,18 +265,24 @@ If you want scalars of your custom extension type to convert to a custom type wh
304
265
For example, if we wanted the above example 3D point type to return a custom
305
266
3D point class instead of a list, we would implement::
306
267
268
+ from collections import namedtuple
269
+
307
270
Point3D = namedtuple("Point3D", ["x", "y", "z"])
308
271
309
272
class Point3DScalar(pa.ExtensionScalar):
310
273
def as_py(self) -> Point3D:
311
274
return Point3D(*self.value.as_py())
312
275
313
- class Point3DType(pa.PyExtensionType ):
276
+ class Point3DType(pa.ExtensionType ):
314
277
def __init__(self):
315
- pa.PyExtensionType. __init__(self, pa.list_(pa.float32(), 3))
278
+ super(). __init__(pa.list_(pa.float32(), 3), "my_package.Point3DType" )
316
279
317
- def __reduce__(self):
318
- return Point3DType, ()
280
+ def __arrow_ext_serialize__(self):
281
+ return b''
282
+
283
+ @classmethod
284
+ def __arrow_ext_deserialize__(cls, storage_type, serialized):
285
+ return Point3DType()
319
286
320
287
def __arrow_ext_scalar_class__(self):
321
288
return Point3DScalar
0 commit comments