Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Standardizing Python database connector PyArrow interfaces #2

Open
jonashaag opened this issue Sep 4, 2023 · 32 comments
Open

Comments

@jonashaag
Copy link
Owner

jonashaag commented Sep 4, 2023

Hi all,

motivated by a Polars PR that adds support for various Arrow-compatible Python database drivers, I would like to create a standard for retrieving Arrow data from Python database drivers. Consider this an extension to PEP 249 that adds the following methods:

  • Cursor.arrow_fetchall()
  • Cursor.arrow_fetchmany()

I've created an initial draft and would love everyone's feedback on this initiative and the proposal.

@jonashaag jonashaag changed the title RFC: Standardizing Python database driver interfaces RFC: Standardizing Python database connector PyArrow interfaces Sep 4, 2023
@WillAyd
Copy link

WillAyd commented Sep 4, 2023

Why do you think these methods need to be namespaced with arrow_? And how do you see them being functionally different from the fetchall/fetchmany methods that the adbc-driver-manager already implements?

@jonashaag
Copy link
Owner Author

According to my understanding of ADBC's current implementation, the only changes that are required are:

  • Aliasing fetchallarrow() to arrow_fetchall()
  • Adding a arrow_fetchmany().

Why do you think these methods need to be namespaced with arrow_?

Because the non-prefixed ones are already defined by PEP 249. Not sure I get what you mean.

@WillAyd
Copy link

WillAyd commented Sep 4, 2023

I was coming from the perspective that fetchmany / fetchall could return an arrow object for a particular driver if they wanted to, but I guess that would be incompatible with the current PEP.

Do other CPython PEPs standardize access to non-standard library objects?

@jonashaag
Copy link
Owner Author

Do other CPython PEPs standardize access to non-standard library objects?

Good question, and that's also why I'm unsure this should be a PEP or a less "official" document.

Do you have any feedback on the contents of the document? Do you think the ADBC driver would be willing to implement the proposal as currently specified?

@lidavidm
Copy link

lidavidm commented Sep 4, 2023

Why not standardize on the methods that DuckDB and ADBC already implement?

Also:

  • We should expose a way to get the Arrow schema.
  • We should expose a way to get an Arrow RecordBatchReader.
  • Would a PEP really talk about a third party library like this? It might be better to have this as an ecosystem proposal akin to the dataframe interchange project or similar? (+Could you reach out to dev@arrow.apache.org?)
  • Is arrow_fetchmany actually useful? (And wouldn't it make more sense to return a RecordBatch?)

@jonashaag
Copy link
Owner Author

Why not standardize on the methods that DuckDB and ADBC already implement?

My proposed names are consistent with PEP 249 plus a prefix. Happy to pick any other names though.

We should expose a way to get the Arrow schema.

You mean without actually fetching any data? Good idea, although I'm not sure every driver will be able to support this?

We should expose a way to get an Arrow RecordBatchReader.

Can you explain the benefits of getting RecordBatches vs. Tables? I'm not familiar with record batches.

Would a PEP really talk about a third party library like this? It might be better to have this as an ecosystem proposal akin to the dataframe interchange project or similar?

Agreed, that might be a better place for this.

Is arrow_fetchmany actually useful?

Good question; it is exposed by Databricks and similar things are exposed by other drivers so I added it to the spec.

@lidavidm
Copy link

lidavidm commented Sep 4, 2023

You mean without actually fetching any data? Good idea, although I'm not sure every driver will be able to support this?

I just mean as an analogy to Cursor.description

Can you explain the benefits of getting RecordBatches vs. Tables? I'm not familiar with record batches.

A RecordBatch is contiguous in memory

@jonashaag
Copy link
Owner Author

I just mean as an analogy to Cursor.description

I was thinking about this as well, but I think we don't need description. I think the purpose of description is:

  1. To provide a list of column names
  2. To provide a list of the original (SQL) types, eg. to distinguish TEXT from VARCHAR (both are represented using strings)
  3. The other 5 fields in the 7-tuple are typically unused.

We can determine both from a table (from a record batch as well?).

(Btw, the built in Python sqlite connector does not properly implement PEP 249: It misses 2.) from above.)

@lidavidm
Copy link

lidavidm commented Sep 4, 2023

I think it's useful to get the Arrow schema without having to actually read any data (even if you have to execute the query first).

SQlite doesn't technically have types, so that's not surprising.

@jonashaag
Copy link
Owner Author

@lidavidm do you think there is value in having an extra interface for receiving record batches? If all you want is contiguous data, isn't that covered by the table API already if you implement it in the connector so that the table only has contiguous arrays? Or do you think all connectors should provide an API that guarantees contiguous array results? (In this case, what would you expect connectors to do if they are unable to receive contiguous data from upstream?)

@lidavidm
Copy link

lidavidm commented Sep 4, 2023

Arrow IPC data is always contiguous (there is no such thing as a table in IPC), so I'm not sure how you would get non-contiguous data there.

@alexander-beedie
Copy link

alexander-beedie commented Sep 5, 2023

I'd actually suggest suffixing _arrow rather than prefixing arrow_. This collects the various fetch* methods together such that they are more discoverable, rather than having related methods live at a (lexical) distance from each other.

  • Cursor.fetchall()
  • Cursor.fetchmany()
  • Cursor.fetchall_arrow()
  • Cursor.fetchmany_arrow()

@jonashaag
Copy link
Owner Author

My idea with the prefix was to have better autocomplete support, but I'm fine either way.

@alexander-beedie
Copy link

alexander-beedie commented Sep 5, 2023

Catching up in realtime here; let me add some more colour... ;)

Why not standardize on the methods that DuckDB and ADBC already implement?

Could do, but it's already a very heterogenous space, and consistency with PEP249 would suggest fetchall_arrow & fetchmany_arrow (as implemented by Databricks).

To make more concrete why it would be beneficial to standardise, here are the backends I've already had to inspect (and add indirection for) in order to properly implement the polars read_database feature (pola-rs/polars#10649) that inspired @jonashaag to initiate this RFC (can also be found in the "interfaces.md" file/link):

ADBC:

  • fetch_arrow_table
  • fetch_record_batch (coming soon)

Databricks:

  • fetchall_arrow
  • fetchmany_arrow (supports exact batch size)

DuckDB:

  • arrow
  • fetch_arrow_table
  • fetch_record_batch (supports exact batch size)

Snowflake driver:

  • fetch_arrow_all
  • fetch_arrow_batches (batch size determined by backend)

Turbodbc:

  • fetchallarrow
  • fetcharrowbatches (batch size determined by backend)

Google BigQuery Client (coming soon):

  • to_arrow, no batch method; client object is used as a connection.

arrow-odbc-py (coming soon):

  • read_arrow_batches_from_odbc (most non-standard of all, only returns batches)

I'm considering breaking out the relevant code as a small helper library later to assist other projects that need to do the same thing, as otherwise everyone wanting to receive Arrow data from database queries is going to have to go through the same thing.

Also, we're leaving Flight drivers out of this picture (InfluxDB, Dremio, etc), as they are their own thing (a wrapper that could translate Flight drivers into a more familiar fetch* style might be a plausible value-add later, but it's out of scope here ;)


It might be better to have this as an ecosystem proposal akin to the dataframe interchange project or similar?

Yes an ecosystem proposal is probably the way to go, given that Arrow data doesn't otherwise exist in Python core. (@MarcoGorelli, you seem to have a number of such proposals underway already, maybe we can tap your experience here!)

Is arrow_fetchmany actually useful?

Yes, for essentially all the same reasons that the existing fetchmany is useful (and it looks like the next update to ADBC will support a fetch_arrow_batches). Whether we expect this to return distinct Table objects or a series of RecordBatches is up in the air; I'm agnostic here.

@jonashaag
Copy link
Owner Author

@alexander-beedie added BigQuery and arrow-odbc-py to interfaces.md.

@jonashaag
Copy link
Owner Author

I'm considering breaking out the relevant code as a small helper library

I'd actually argue we should instead push this proposal forward and implement a unified interface everywhere.

@alexander-beedie
Copy link

alexander-beedie commented Sep 5, 2023

I'd actually argue we should instead push this proposal forward and implement a unified interface everywhere.

Indeed; essentially it would serve both as a stopgap and as an example of why some standardisation would be welcome ;)

@lidavidm
Copy link

lidavidm commented Sep 5, 2023

Also, we're leaving Flight drivers out of this picture (InfluxDB, Dremio, etc), as they are their own thing (a wrapper that could translate Flight drivers into a more familiar fetch* style might be a plausible value-add later, but it's out of scope here ;)

Already a thing: https://pypi.org/project/adbc-driver-flightsql/

(Not sure why Dremio/InfluxDB won't use it. It is integration-tested against Dremio.)

@alexander-beedie
Copy link

alexander-beedie commented Sep 5, 2023

Already a thing: https://pypi.org/project/adbc-driver-flightsql/

I must be going blind; it's literally right there, haha... very nice. I'll add a note about it in the polars read_database docs as I'm sure that will be useful for anyone else (like me) who somehow failed to notice that it exists.

@pitrou
Copy link

pitrou commented Sep 6, 2023

Already a thing: https://pypi.org/project/adbc-driver-flightsql/

You might add a description and also provide a documentation link that points directly to the driver's documentation instead of the Arrow homepage :-)

@pitrou
Copy link

pitrou commented Sep 6, 2023

A general design question: should those APIs be specified to return PyArrow objects (which is quite specific) or objects implementing the dataframe protocol?

Since PyArrow implements the dataframe protocol, returning a PyArrow table from those methods would still be possible. Conversely, the caller of these methods could easily convert their result to a PyArrow table (and if the data is already in the Arrow format, the conversion is presumably zero-copy? @AlenkaF).

However, the dataframe protocol supports fewer datatypes than Arrow does (for example, no nested types), so being strictly conforming would impose more limitations.

Also, the dataframe protocol spec is not finalized and still seems to be undergoing changes, which might make it an unsuitable target for now.

@pitrou
Copy link

pitrou commented Sep 6, 2023

Another design option is for those APIs to return PyCapsule objects to the C Data Interface. However, this would first require apache/arrow#34031 to be implemented @jorisvandenbossche :-)

Also, returning a PyCapsule object may make the API lower-level and less easy to use.

@alexander-beedie
Copy link

alexander-beedie commented Sep 6, 2023

A general design question: should those APIs be specified to return PyArrow objects (which is quite specific) or objects implementing the dataframe protocol?

Definitely Arrow; not everything that consumes Arrow data is a DataFrame, so the additional indirection isn't going to be helpful in many cases. Meanwhile it's trivial to go from Arrow to a DataFrame (via the interchange protocol or otherwise, typically zero-copy) if that is the intended end state.

PyCapsule seems like something that adbc_driver_manager would be taking advantage of behind the scenes, rather than exposing front & centre to the caller? (Who likely just wants a Table or RecordBatch :)

@pitrou
Copy link

pitrou commented Sep 6, 2023

Meanwhile it's trivial to go from Arrow to a DataFrame (via the interchange protocol or otherwise, and typically zero-copy) if that is the intended end state

Sure, but users have to pay the cost of the PyArrow dependency even if they don't use it. That may stifle adoption.

Also, depending on what your standardization goal is, a library-agnostic protocol may be an easier sell than a API returning PyArrow objects.

@alexander-beedie
Copy link

alexander-beedie commented Sep 6, 2023

Also, depending on what your standardization goal is, a library-agnostic protocol may be an easier sell than a API returning PyArrow objects.

Indeed, a fetch_dataframe could be a nice extension that wouldn't require pyarrow if it's interchange-protocol compliant, but the db-connectivity libraries referenced above already have baked-in pyarrow dependencies (as they explicitly offer Table and/or RecordBatch return), so standardising the methods by which they return that data to Python wouldn't seem to imply any fresh dependency/overhead.

@pitrou
Copy link

pitrou commented Sep 6, 2023

As I said, it depends on what the goal is. If it's meant to be an informal spec for DB backends already producing PyArrow data, then returning PyArrow objects is obviously fine and we merely need to agree on API names and semantic details.

If you want this to be more official and perhaps encourage other backends to adopt it, including backends developed by people who don't care about PyArrow, then you may get better adoption by making the spec library-agnostic.

@AlenkaF
Copy link

AlenkaF commented Sep 6, 2023

Since PyArrow implements the dataframe protocol, returning a PyArrow table from those methods would still be possible. Conversely, the caller of these methods could easily convert their result to a PyArrow table (and if the data is already in the Arrow format, the conversion is presumably zero-copy? @AlenkaF).

There is actually no conversion in this case. If the object being consumed is a pyarrow.Table/RecordBatch then pyarrow dataframe interchange protocol implementation returns the pyarrow object itself:

https://github.com/apache/arrow/blob/ad7f6ef3a97be6a3eb6fe37d0df1b1634db057d2/python/pyarrow/interchange/from_dataframe.py#L103-L106

Adding the correct link to the docs: https://arrow.apache.org/docs/dev/python/interchange_protocol.html

However, the dataframe protocol supports fewer datatypes than Arrow does (for example, no nested types), so being strictly conforming would impose more limitations.

Agree. This limitations are quite strong for the case being discussed here in my opinion.

@jonashaag
Copy link
Owner Author

jonashaag commented Sep 6, 2023

Thanks for the valuable input @pitrou.

If it's meant to be an informal spec for DB backends already producing PyArrow data

Not exactly the goal of this proposal, but motivated by those (incompatible) DB backends.

The goal is to have a standardized interface between "Arrow in Python" and Python database connectors.

IMO "Arrow in Python" doesn't necessarily mean it must be PyArrow. Although currently that's the only viable option to directly work on Table/BatchRecord objects in Python if I am informed correctly? Polars can consume Arrow data without PyArrow but you can't really interact with the Arrow data without having PyArrow installed.

Having PyCapsule output instead of pyarrow.* would be fine IMO, iff it's relatively easy to convert to pyarrow.*.

@pitrou
Copy link

pitrou commented Sep 6, 2023

Having PyCapsule output instead of pyarrow.* would be fine IMO, iff it's relatively easy to convert to pyarrow.*.

We can probably keep that in mind while working out that API and how PyArrow interacts with it. But it's difficult to say until it's actually implemented :-)

@jonashaag
Copy link
Owner Author

jonashaag commented Sep 6, 2023

I've made up my mind on this. I want to standardize something now, and I'm happy to pay the price of it being less universal and less "official". So, I'd like to go forward with this proposal with pyarrow.* output types.

@wjones127
Copy link

FYI just want to chime in that I am pushing the PyCapsule work forward now, in case that changes your feelings about that option. 😄

apache/arrow#37797

@jonashaag
Copy link
Owner Author

@alexander-beedie could you please have a look at the PyCapsule interface? Is this something Polars would be willing to support for data loading? If yes, it might be a better alternative to requiring PyArrow for this RFC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants