[Python] Define a Dataset protocol based on Substrait and C Data Interface #37504

wjones127 · 2023-09-01T01:59:26Z

Describe the enhancement requested

Based on discussion in the 2023-08-30 Arrow community meeting. This is a continuation of #35568 and #33986.

We'd like to have a protocol for sharing unmaterialized datasets that:

Can be consumed as one or more streams of Arrow data
Can have projections and filters pushed down to the scanner

This would provide a extendible connection between scanners and query engines. Data formats might include Iceberg, Delta Lake, Lance, and PyArrow datasets (parquet, JSON, CSV). Query engines could include DuckDB, DataFusion, Polars, PyVelox, PySpark, Ray, and Dask. Such a connection would let end-users employ their preferred query engine to load any supported dataset. From their perspective, usage would might look like:

from deltalake import DeltaTable
table = DeltaTable("path/to/table")

import duckdb
duckdb.sql("SELECT y FROM table WHERE x > 3")

The protocol is largely invisible to the user. Behind the scenes, duckdb would call __arrow_scanner__() on table to get a scannable object. It would then pass down the column selection ['y'] and the filter x > 3 to the scanner, and get the get the resulting data stream as input to the query.

Shape of the protocol

The overall shape would look roughly like:

from abc import ABC

class AbstractArrowScannable(ABC):
    def __arrow_scanner__(self) -> AbstractArrowScanner


class AbstractArrowScanner(ABC):
    def get_schema(self) -> capsule[ArrowSchema]:
        ...

    def get_stream(
        self,
        columns: List[str],
        filter: SubstraitExpression,
    ) -> capsule[ArrowArrayStream]:
        ...

    def get_partitions(self, filter: filter: SubstraitExpression) -> list[AbstractArrowScanner]:
        ...

Data and schema are returned as C Data Interface objects (see: #35531). Predicates are passed as Substrait extended expressions.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

paleolimbot · 2023-09-02T00:08:23Z

Is schema negotiation outside the scope of this protocol? If get_schema() contains a Utf8View, for example, is it the consumer's responsibility to do the cast, or can the consumer pass a schema with Utf8View columns as Utf8 to get_stream() (or another method)?

wjones127 · 2023-09-05T15:50:24Z

Is schema negotiation outside the scope of this protocol?

I think we can include that. I'd like to design that as part of the PyCapsule API first, so we match the semantics there.

wjones127 · 2023-12-29T22:28:30Z

Haven't had time to work on this, but wanted to note here a current pain point for users of the dataset API is that there aren't table statistics the caller can access, and this leads to bad join orders. Some mentions of this here:

https://twitter.com/mim_djo/status/1740542585410814393
delta-io/delta-rs#1838

pitrou · 2024-02-27T15:16:12Z

Are we sure a blocking API like this would be palatable for existing execution engines such as Acero, DuckDB... ?

Of course, at worse the various method/function calls can be offloaded to a dedicated thread pool.

wjones127 · 2024-02-27T16:25:06Z

Are we sure a blocking API like this would be palatable?

Are you referring to the fact they would have to acquire the GIL to call these methods? Or something else?

Ideally all these methods are brief.

Though I haven't discussed this in depth with implementors of query engines. I'd be curious for their thoughts.

pitrou · 2024-02-27T17:08:57Z

Are we sure a blocking API like this would be palatable?

Are you referring to the fact they would have to acquire the GIL to call these methods? Or something else?

No, to the fact that these functions are synchronous.

Ideally all these methods are brief.

I'm not sure. get_partitions will typically have to walk a filesystem, which can be long-ish especially on large datasets or remote filesystems.

paleolimbot · 2024-02-27T19:29:22Z

Perhaps get_partitions(...) -> Iterable[AbstractArrowScanner] would do it? Not sure if anybody is interested in asyncio for this but an async iterator might work.

pitrou · 2024-02-27T19:31:08Z

An Iterable would probably be better indeed. It would not solve the async use case directly but we would at least allow producing results without blocking on the entire filesystem walk.

wjones127 added Type: enhancement Component: Python labels Sep 1, 2023

wjones127 self-assigned this Sep 1, 2023

jorisvandenbossche mentioned this issue Oct 4, 2023

GH-35531: [Python] C Data Interface PyCapsule Protocol #37797

Merged

wjones127 mentioned this issue Nov 8, 2023

Expose a python method to use RecordBatchReader instead of PyArrow Dataset delta-io/delta-rs#1814

Open

eitsupi mentioned this issue May 29, 2024

Support Reading Substrait plans GlareDB/glaredb#1907

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Define a Dataset protocol based on Substrait and C Data Interface #37504

[Python] Define a Dataset protocol based on Substrait and C Data Interface #37504

wjones127 commented Sep 1, 2023 •

edited

Loading

paleolimbot commented Sep 2, 2023

wjones127 commented Sep 5, 2023

wjones127 commented Dec 29, 2023

pitrou commented Feb 27, 2024

wjones127 commented Feb 27, 2024

pitrou commented Feb 27, 2024

paleolimbot commented Feb 27, 2024

pitrou commented Feb 27, 2024

[Python] Define a Dataset protocol based on Substrait and C Data Interface #37504

[Python] Define a Dataset protocol based on Substrait and C Data Interface #37504

Comments

wjones127 commented Sep 1, 2023 • edited Loading

Describe the enhancement requested

Shape of the protocol

Component(s)

paleolimbot commented Sep 2, 2023

wjones127 commented Sep 5, 2023

wjones127 commented Dec 29, 2023

pitrou commented Feb 27, 2024

wjones127 commented Feb 27, 2024

pitrou commented Feb 27, 2024

paleolimbot commented Feb 27, 2024

pitrou commented Feb 27, 2024

wjones127 commented Sep 1, 2023 •

edited

Loading