Skip to content

Latest commit

 

History

History
90 lines (70 loc) · 3.92 KB

design.md

File metadata and controls

90 lines (70 loc) · 3.92 KB

Arrowbic technical design

Arrowbic is built on top of PyArrow extension types, extending the standard API defined in the later. Every extension type registered in the library must follow the following minimal interface:

import pyarrow as pa
import arrowbic


class MyExtensionArray(arrowbic.BaseExtensionArray):
    """A custom extension array allows to implement custom interface, improving the usability of the
    extension array in common Python code.

    The `BaseExtensionArray` is already extending the `ExtensionArray` API from PyArrow, adding common
    methods to filter, ...
    """


@arrowbic.register_extension_type
class MyExtensionType(arrowbic.BaseExtensionType):
    def __init__(self, storage_type: pa.DataType = None, item_pyclass: Type[Any] = None, extension_array_cls: pa.ExtensionArray = MyExtensionArray):
        """Constructor of an extension type instance.

        The default constructor with no arguments should also return the same default extension type instance, used
        for registering in PyArrow. Other instances with non-none arguments are cached by Arrowbic internally, such
        that PyArrow/Arrowbic does not constantly re-deserialize them.

        Args:
            storage_type: Storage type to use for this instance.
            item_pyclass: Item Python class to associate with the extension type.
            extension_array_cls: Extension array class.
        """
        pass

    def __arrowbic_ext_metadata__(self) -> Dict[str, Any]:
        """Extract the metadata of the extension type.

        Returns:
            Extension type metadata dictionary.
        """
        return super().__arrowbic_ext_metadata__()

    @classmethod
    def __arrowbic_priority__(cls) -> int:
        """Defines the ordering in the Arrowbic registry, used for finding the extension type
        matching a Python class.

        Returns:
            Integer priority, used in the registry ordering. Zero by default.
        """
        return 0

    @classmethod
    def __arrowbic_is_item_pyclass_supported__(cls, item_pyclass: Type[Any]) -> bool:
        """Is the extension type supporting/compatible with an item Python class?

        Args:
            item_pyclass: Item Python class.
        Returns:
            Is the class supported by the extension type?
        """
        raise NotImplementedError()

    @classmethod
    def __arrowbic_make_item_pyclass__(cls, storage_type: pa.DataType, ext_metadata: Dict[str, Any]) -> Type[Any]:
        """Generate the Python item class from the Arrow storage type and extension metadata.

        Args:
            storage_type: Storage type used.
            ext_metadata: Extension metadata.
        Returns:
            Generated Python item class compatible with the storage type and metadata.
        """
        raise NotImplementedError()

    def __arrowbic_from_item_iterator__(self, it_items: Iterator[Any]) -> pa.ExtensionArray:
        """Build the extension array from a Python item iterator.

        Args:
            it_items: Items Python iterator.
        Returns:
            Extension array, with the proper data.
        """
        raise NotImplementedError()

The additional interface is required for Arrowbic to provide these functionalities:

  • Generic arrowbic.array factory method: this method must be able to associate a Python class to extension type in order to build the adequate array. Then, the extension type must know how to build in practice the extension array from a collection of Python items;
  • Deserialize Arrowbic arrays serialized by a different package or/and service. It is fairly common for data to be serialized, either in file format or raw IPC format, and then deserialize in a service/program which was no knowledge of the item Python classes registered and used in the first place. Hence, Arrowbic extension type are required to be able to re-generate a valid Python item class from the Arrow storage type and the extension metadata;