RFC: Standardizing Python database connector PyArrow interfaces #2

jonashaag · 2023-09-04T06:39:48Z

Hi all,

motivated by a Polars PR that adds support for various Arrow-compatible Python database drivers, I would like to create a standard for retrieving Arrow data from Python database drivers. Consider this an extension to PEP 249 that adds the following methods:

Cursor.arrow_fetchall()
Cursor.arrow_fetchmany()

I've created an initial draft and would love everyone's feedback on this initiative and the proposal.

The text was updated successfully, but these errors were encountered:

WillAyd · 2023-09-04T13:36:29Z

Why do you think these methods need to be namespaced with arrow_? And how do you see them being functionally different from the fetchall/fetchmany methods that the adbc-driver-manager already implements?

jonashaag · 2023-09-04T13:40:30Z

According to my understanding of ADBC's current implementation, the only changes that are required are:

Aliasing fetchallarrow() to arrow_fetchall()
Adding a arrow_fetchmany().

Why do you think these methods need to be namespaced with arrow_?

Because the non-prefixed ones are already defined by PEP 249. Not sure I get what you mean.

WillAyd · 2023-09-04T13:57:59Z

I was coming from the perspective that fetchmany / fetchall could return an arrow object for a particular driver if they wanted to, but I guess that would be incompatible with the current PEP.

Do other CPython PEPs standardize access to non-standard library objects?

jonashaag · 2023-09-04T14:02:00Z

Do other CPython PEPs standardize access to non-standard library objects?

Good question, and that's also why I'm unsure this should be a PEP or a less "official" document.

Do you have any feedback on the contents of the document? Do you think the ADBC driver would be willing to implement the proposal as currently specified?

lidavidm · 2023-09-04T14:41:07Z

Why not standardize on the methods that DuckDB and ADBC already implement?

Also:

We should expose a way to get the Arrow schema.
We should expose a way to get an Arrow RecordBatchReader.
Would a PEP really talk about a third party library like this? It might be better to have this as an ecosystem proposal akin to the dataframe interchange project or similar? (+Could you reach out to dev@arrow.apache.org?)
Is arrow_fetchmany actually useful? (And wouldn't it make more sense to return a RecordBatch?)

jonashaag · 2023-09-04T14:55:17Z

Why not standardize on the methods that DuckDB and ADBC already implement?

My proposed names are consistent with PEP 249 plus a prefix. Happy to pick any other names though.

We should expose a way to get the Arrow schema.

You mean without actually fetching any data? Good idea, although I'm not sure every driver will be able to support this?

We should expose a way to get an Arrow RecordBatchReader.

Can you explain the benefits of getting RecordBatches vs. Tables? I'm not familiar with record batches.

Would a PEP really talk about a third party library like this? It might be better to have this as an ecosystem proposal akin to the dataframe interchange project or similar?

Agreed, that might be a better place for this.

Is arrow_fetchmany actually useful?

Good question; it is exposed by Databricks and similar things are exposed by other drivers so I added it to the spec.

lidavidm · 2023-09-04T14:58:37Z

You mean without actually fetching any data? Good idea, although I'm not sure every driver will be able to support this?

I just mean as an analogy to Cursor.description

Can you explain the benefits of getting RecordBatches vs. Tables? I'm not familiar with record batches.

A RecordBatch is contiguous in memory

jonashaag · 2023-09-04T15:13:10Z

I just mean as an analogy to Cursor.description

I was thinking about this as well, but I think we don't need description. I think the purpose of description is:

To provide a list of column names
To provide a list of the original (SQL) types, eg. to distinguish TEXT from VARCHAR (both are represented using strings)
The other 5 fields in the 7-tuple are typically unused.

We can determine both from a table (from a record batch as well?).

(Btw, the built in Python sqlite connector does not properly implement PEP 249: It misses 2.) from above.)

lidavidm · 2023-09-04T16:50:35Z

I think it's useful to get the Arrow schema without having to actually read any data (even if you have to execute the query first).

SQlite doesn't technically have types, so that's not surprising.

jonashaag · 2023-09-04T17:34:48Z

@lidavidm do you think there is value in having an extra interface for receiving record batches? If all you want is contiguous data, isn't that covered by the table API already if you implement it in the connector so that the table only has contiguous arrays? Or do you think all connectors should provide an API that guarantees contiguous array results? (In this case, what would you expect connectors to do if they are unable to receive contiguous data from upstream?)

lidavidm · 2023-09-04T19:05:23Z

Arrow IPC data is always contiguous (there is no such thing as a table in IPC), so I'm not sure how you would get non-contiguous data there.

alexander-beedie · 2023-09-05T05:43:23Z

I'd actually suggest suffixing _arrow rather than prefixing arrow_. This collects the various fetch* methods together such that they are more discoverable, rather than having related methods live at a (lexical) distance from each other.

Cursor.fetchall()
Cursor.fetchmany()
Cursor.fetchall_arrow()
Cursor.fetchmany_arrow()

jonashaag · 2023-09-05T05:52:18Z

My idea with the prefix was to have better autocomplete support, but I'm fine either way.

alexander-beedie · 2023-09-05T06:55:34Z

Catching up in realtime here; let me add some more colour... ;)

Why not standardize on the methods that DuckDB and ADBC already implement?

Could do, but it's already a very heterogenous space, and consistency with PEP249 would suggest fetchall_arrow & fetchmany_arrow (as implemented by Databricks).

To make more concrete why it would be beneficial to standardise, here are the backends I've already had to inspect (and add indirection for) in order to properly implement the polars read_database feature (pola-rs/polars#10649) that inspired @jonashaag to initiate this RFC (can also be found in the "interfaces.md" file/link):

ADBC:

fetch_arrow_table
fetch_record_batch (coming soon)

Databricks:

fetchall_arrow
fetchmany_arrow (supports exact batch size)

DuckDB:

arrow
fetch_arrow_table
fetch_record_batch (supports exact batch size)

Snowflake driver:

fetch_arrow_all
fetch_arrow_batches (batch size determined by backend)

Turbodbc:

fetchallarrow
fetcharrowbatches (batch size determined by backend)

Google BigQuery Client (coming soon):

to_arrow, no batch method; client object is used as a connection.

arrow-odbc-py (coming soon):

read_arrow_batches_from_odbc (most non-standard of all, only returns batches)

I'm considering breaking out the relevant code as a small helper library later to assist other projects that need to do the same thing, as otherwise everyone wanting to receive Arrow data from database queries is going to have to go through the same thing.

Also, we're leaving Flight drivers out of this picture (InfluxDB, Dremio, etc), as they are their own thing (a wrapper that could translate Flight drivers into a more familiar fetch* style might be a plausible value-add later, but it's out of scope here ;)

It might be better to have this as an ecosystem proposal akin to the dataframe interchange project or similar?

Yes an ecosystem proposal is probably the way to go, given that Arrow data doesn't otherwise exist in Python core. (@MarcoGorelli, you seem to have a number of such proposals underway already, maybe we can tap your experience here!)

Is arrow_fetchmany actually useful?

Yes, for essentially all the same reasons that the existing fetchmany is useful (and it looks like the next update to ADBC will support a fetch_arrow_batches). Whether we expect this to return distinct Table objects or a series of RecordBatches is up in the air; I'm agnostic here.

jonashaag · 2023-09-05T07:36:46Z

@alexander-beedie added BigQuery and arrow-odbc-py to interfaces.md.

jonashaag · 2023-09-05T07:37:42Z

I'm considering breaking out the relevant code as a small helper library

I'd actually argue we should instead push this proposal forward and implement a unified interface everywhere.

alexander-beedie · 2023-09-05T07:43:51Z

I'd actually argue we should instead push this proposal forward and implement a unified interface everywhere.

Indeed; essentially it would serve both as a stopgap and as an example of why some standardisation would be welcome ;)

lidavidm · 2023-09-05T11:48:20Z

Also, we're leaving Flight drivers out of this picture (InfluxDB, Dremio, etc), as they are their own thing (a wrapper that could translate Flight drivers into a more familiar fetch* style might be a plausible value-add later, but it's out of scope here ;)

Already a thing: https://pypi.org/project/adbc-driver-flightsql/

(Not sure why Dremio/InfluxDB won't use it. It is integration-tested against Dremio.)

alexander-beedie · 2023-09-05T12:08:42Z

Already a thing: https://pypi.org/project/adbc-driver-flightsql/

I must be going blind; it's literally right there, haha... very nice. I'll add a note about it in the polars read_database docs as I'm sure that will be useful for anyone else (like me) who somehow failed to notice that it exists.

pitrou · 2023-09-06T07:52:05Z

Already a thing: https://pypi.org/project/adbc-driver-flightsql/

You might add a description and also provide a documentation link that points directly to the driver's documentation instead of the Arrow homepage :-)

pitrou · 2023-09-06T08:00:42Z

A general design question: should those APIs be specified to return PyArrow objects (which is quite specific) or objects implementing the dataframe protocol?

Since PyArrow implements the dataframe protocol, returning a PyArrow table from those methods would still be possible. Conversely, the caller of these methods could easily convert their result to a PyArrow table (and if the data is already in the Arrow format, the conversion is presumably zero-copy? @AlenkaF).

However, the dataframe protocol supports fewer datatypes than Arrow does (for example, no nested types), so being strictly conforming would impose more limitations.

Also, the dataframe protocol spec is not finalized and still seems to be undergoing changes, which might make it an unsuitable target for now.

pitrou · 2023-09-06T08:05:03Z

Another design option is for those APIs to return PyCapsule objects to the C Data Interface. However, this would first require apache/arrow#34031 to be implemented @jorisvandenbossche :-)

Also, returning a PyCapsule object may make the API lower-level and less easy to use.

alexander-beedie · 2023-09-06T08:09:47Z

A general design question: should those APIs be specified to return PyArrow objects (which is quite specific) or objects implementing the dataframe protocol?

Definitely Arrow; not everything that consumes Arrow data is a DataFrame, so the additional indirection isn't going to be helpful in many cases. Meanwhile it's trivial to go from Arrow to a DataFrame (via the interchange protocol or otherwise, typically zero-copy) if that is the intended end state.

PyCapsule seems like something that adbc_driver_manager would be taking advantage of behind the scenes, rather than exposing front & centre to the caller? (Who likely just wants a Table or RecordBatch :)

pitrou · 2023-09-06T08:12:40Z

Meanwhile it's trivial to go from Arrow to a DataFrame (via the interchange protocol or otherwise, and typically zero-copy) if that is the intended end state

Sure, but users have to pay the cost of the PyArrow dependency even if they don't use it. That may stifle adoption.

Also, depending on what your standardization goal is, a library-agnostic protocol may be an easier sell than a API returning PyArrow objects.

alexander-beedie · 2023-09-06T08:24:19Z

Also, depending on what your standardization goal is, a library-agnostic protocol may be an easier sell than a API returning PyArrow objects.

Indeed, a fetch_dataframe could be a nice extension that wouldn't require pyarrow if it's interchange-protocol compliant, but the db-connectivity libraries referenced above already have baked-in pyarrow dependencies (as they explicitly offer Table and/or RecordBatch return), so standardising the methods by which they return that data to Python wouldn't seem to imply any fresh dependency/overhead.

pitrou · 2023-09-06T08:30:49Z

As I said, it depends on what the goal is. If it's meant to be an informal spec for DB backends already producing PyArrow data, then returning PyArrow objects is obviously fine and we merely need to agree on API names and semantic details.

If you want this to be more official and perhaps encourage other backends to adopt it, including backends developed by people who don't care about PyArrow, then you may get better adoption by making the spec library-agnostic.

AlenkaF · 2023-09-06T08:52:37Z

Since PyArrow implements the dataframe protocol, returning a PyArrow table from those methods would still be possible. Conversely, the caller of these methods could easily convert their result to a PyArrow table (and if the data is already in the Arrow format, the conversion is presumably zero-copy? @AlenkaF).

There is actually no conversion in this case. If the object being consumed is a pyarrow.Table/RecordBatch then pyarrow dataframe interchange protocol implementation returns the pyarrow object itself:

https://github.com/apache/arrow/blob/ad7f6ef3a97be6a3eb6fe37d0df1b1634db057d2/python/pyarrow/interchange/from_dataframe.py#L103-L106

Adding the correct link to the docs: https://arrow.apache.org/docs/dev/python/interchange_protocol.html

However, the dataframe protocol supports fewer datatypes than Arrow does (for example, no nested types), so being strictly conforming would impose more limitations.

Agree. This limitations are quite strong for the case being discussed here in my opinion.

jonashaag · 2023-09-06T13:13:51Z

Thanks for the valuable input @pitrou.

If it's meant to be an informal spec for DB backends already producing PyArrow data

Not exactly the goal of this proposal, but motivated by those (incompatible) DB backends.

The goal is to have a standardized interface between "Arrow in Python" and Python database connectors.

IMO "Arrow in Python" doesn't necessarily mean it must be PyArrow. Although currently that's the only viable option to directly work on Table/BatchRecord objects in Python if I am informed correctly? Polars can consume Arrow data without PyArrow but you can't really interact with the Arrow data without having PyArrow installed.

Having PyCapsule output instead of pyarrow.* would be fine IMO, iff it's relatively easy to convert to pyarrow.*.

pitrou · 2023-09-06T13:30:52Z

Having PyCapsule output instead of pyarrow.* would be fine IMO, iff it's relatively easy to convert to pyarrow.*.

We can probably keep that in mind while working out that API and how PyArrow interacts with it. But it's difficult to say until it's actually implemented :-)

jonashaag · 2023-09-06T15:35:05Z

I've made up my mind on this. I want to standardize something now, and I'm happy to pay the price of it being less universal and less "official". So, I'd like to go forward with this proposal with pyarrow.* output types.

wjones127 · 2023-09-23T02:44:21Z

FYI just want to chime in that I am pushing the PyCapsule work forward now, in case that changes your feelings about that option. 😄

apache/arrow#37797

jonashaag · 2023-09-23T18:10:06Z

@alexander-beedie could you please have a look at the PyCapsule interface? Is this something Polars would be willing to support for data loading? If yes, it might be a better alternative to requiring PyArrow for this RFC.

jonashaag changed the title ~~RFC: Standardizing Python database driver interfaces~~ RFC: Standardizing Python database connector PyArrow interfaces Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Standardizing Python database connector PyArrow interfaces #2

RFC: Standardizing Python database connector PyArrow interfaces #2

jonashaag commented Sep 4, 2023 •

edited

Loading

WillAyd commented Sep 4, 2023

jonashaag commented Sep 4, 2023

WillAyd commented Sep 4, 2023

jonashaag commented Sep 4, 2023

lidavidm commented Sep 4, 2023

jonashaag commented Sep 4, 2023

lidavidm commented Sep 4, 2023

jonashaag commented Sep 4, 2023

lidavidm commented Sep 4, 2023

jonashaag commented Sep 4, 2023

lidavidm commented Sep 4, 2023

alexander-beedie commented Sep 5, 2023 •

edited

Loading

jonashaag commented Sep 5, 2023

alexander-beedie commented Sep 5, 2023 •

edited

Loading

jonashaag commented Sep 5, 2023

jonashaag commented Sep 5, 2023

alexander-beedie commented Sep 5, 2023 •

edited

Loading

lidavidm commented Sep 5, 2023

alexander-beedie commented Sep 5, 2023 •

edited

Loading

pitrou commented Sep 6, 2023

pitrou commented Sep 6, 2023

pitrou commented Sep 6, 2023

alexander-beedie commented Sep 6, 2023 •

edited

Loading

pitrou commented Sep 6, 2023

alexander-beedie commented Sep 6, 2023 •

edited

Loading

pitrou commented Sep 6, 2023

AlenkaF commented Sep 6, 2023

jonashaag commented Sep 6, 2023 •

edited

Loading

pitrou commented Sep 6, 2023

jonashaag commented Sep 6, 2023 •

edited

Loading

wjones127 commented Sep 23, 2023

jonashaag commented Sep 23, 2023

RFC: Standardizing Python database connector PyArrow interfaces #2

RFC: Standardizing Python database connector PyArrow interfaces #2

Comments

jonashaag commented Sep 4, 2023 • edited Loading

WillAyd commented Sep 4, 2023

jonashaag commented Sep 4, 2023

WillAyd commented Sep 4, 2023

jonashaag commented Sep 4, 2023

lidavidm commented Sep 4, 2023

jonashaag commented Sep 4, 2023

lidavidm commented Sep 4, 2023

jonashaag commented Sep 4, 2023

lidavidm commented Sep 4, 2023

jonashaag commented Sep 4, 2023

lidavidm commented Sep 4, 2023

alexander-beedie commented Sep 5, 2023 • edited Loading

jonashaag commented Sep 5, 2023

alexander-beedie commented Sep 5, 2023 • edited Loading

jonashaag commented Sep 5, 2023

jonashaag commented Sep 5, 2023

alexander-beedie commented Sep 5, 2023 • edited Loading

lidavidm commented Sep 5, 2023

alexander-beedie commented Sep 5, 2023 • edited Loading

pitrou commented Sep 6, 2023

pitrou commented Sep 6, 2023

pitrou commented Sep 6, 2023

alexander-beedie commented Sep 6, 2023 • edited Loading

pitrou commented Sep 6, 2023

alexander-beedie commented Sep 6, 2023 • edited Loading

pitrou commented Sep 6, 2023

AlenkaF commented Sep 6, 2023

jonashaag commented Sep 6, 2023 • edited Loading

pitrou commented Sep 6, 2023

jonashaag commented Sep 6, 2023 • edited Loading

wjones127 commented Sep 23, 2023

jonashaag commented Sep 23, 2023

jonashaag commented Sep 4, 2023 •

edited

Loading

alexander-beedie commented Sep 5, 2023 •

edited

Loading

alexander-beedie commented Sep 5, 2023 •

edited

Loading

alexander-beedie commented Sep 5, 2023 •

edited

Loading

alexander-beedie commented Sep 5, 2023 •

edited

Loading

alexander-beedie commented Sep 6, 2023 •

edited

Loading

alexander-beedie commented Sep 6, 2023 •

edited

Loading

jonashaag commented Sep 6, 2023 •

edited

Loading

jonashaag commented Sep 6, 2023 •

edited

Loading