-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Standardizing Python database connector PyArrow interfaces #2
Comments
Why do you think these methods need to be namespaced with arrow_? And how do you see them being functionally different from the fetchall/fetchmany methods that the adbc-driver-manager already implements? |
According to my understanding of ADBC's current implementation, the only changes that are required are:
Because the non-prefixed ones are already defined by PEP 249. Not sure I get what you mean. |
I was coming from the perspective that fetchmany / fetchall could return an arrow object for a particular driver if they wanted to, but I guess that would be incompatible with the current PEP. Do other CPython PEPs standardize access to non-standard library objects? |
Good question, and that's also why I'm unsure this should be a PEP or a less "official" document. Do you have any feedback on the contents of the document? Do you think the ADBC driver would be willing to implement the proposal as currently specified? |
Why not standardize on the methods that DuckDB and ADBC already implement? Also:
|
My proposed names are consistent with PEP 249 plus a prefix. Happy to pick any other names though.
You mean without actually fetching any data? Good idea, although I'm not sure every driver will be able to support this?
Can you explain the benefits of getting
Agreed, that might be a better place for this.
Good question; it is exposed by Databricks and similar things are exposed by other drivers so I added it to the spec. |
I just mean as an analogy to
|
I was thinking about this as well, but I think we don't need
We can determine both from a table (from a record batch as well?). (Btw, the built in Python sqlite connector does not properly implement PEP 249: It misses 2.) from above.) |
I think it's useful to get the Arrow schema without having to actually read any data (even if you have to execute the query first). SQlite doesn't technically have types, so that's not surprising. |
@lidavidm do you think there is value in having an extra interface for receiving record batches? If all you want is contiguous data, isn't that covered by the table API already if you implement it in the connector so that the table only has contiguous arrays? Or do you think all connectors should provide an API that guarantees contiguous array results? (In this case, what would you expect connectors to do if they are unable to receive contiguous data from upstream?) |
Arrow IPC data is always contiguous (there is no such thing as a table in IPC), so I'm not sure how you would get non-contiguous data there. |
I'd actually suggest suffixing
|
My idea with the prefix was to have better autocomplete support, but I'm fine either way. |
Catching up in realtime here; let me add some more colour... ;)
Could do, but it's already a very heterogenous space, and consistency with PEP249 would suggest To make more concrete why it would be beneficial to standardise, here are the backends I've already had to inspect (and add indirection for) in order to properly implement the polars ADBC:
Databricks:
DuckDB:
Snowflake driver:
Turbodbc:
Google BigQuery Client (coming soon):
arrow-odbc-py (coming soon):
I'm considering breaking out the relevant code as a small helper library later to assist other projects that need to do the same thing, as otherwise everyone wanting to receive Arrow data from database queries is going to have to go through the same thing. Also, we're leaving Flight drivers out of this picture (InfluxDB, Dremio, etc), as they are their own thing (a wrapper that could translate Flight drivers into a more familiar
Yes an ecosystem proposal is probably the way to go, given that Arrow data doesn't otherwise exist in Python core. (@MarcoGorelli, you seem to have a number of such proposals underway already, maybe we can tap your experience here!)
Yes, for essentially all the same reasons that the existing |
@alexander-beedie added BigQuery and arrow-odbc-py to interfaces.md. |
I'd actually argue we should instead push this proposal forward and implement a unified interface everywhere. |
Indeed; essentially it would serve both as a stopgap and as an example of why some standardisation would be welcome ;) |
Already a thing: https://pypi.org/project/adbc-driver-flightsql/ (Not sure why Dremio/InfluxDB won't use it. It is integration-tested against Dremio.) |
I must be going blind; it's literally right there, haha... very nice. I'll add a note about it in the polars |
You might add a description and also provide a documentation link that points directly to the driver's documentation instead of the Arrow homepage :-) |
A general design question: should those APIs be specified to return PyArrow objects (which is quite specific) or objects implementing the dataframe protocol? Since PyArrow implements the dataframe protocol, returning a PyArrow table from those methods would still be possible. Conversely, the caller of these methods could easily convert their result to a PyArrow table (and if the data is already in the Arrow format, the conversion is presumably zero-copy? @AlenkaF). However, the dataframe protocol supports fewer datatypes than Arrow does (for example, no nested types), so being strictly conforming would impose more limitations. Also, the dataframe protocol spec is not finalized and still seems to be undergoing changes, which might make it an unsuitable target for now. |
Another design option is for those APIs to return PyCapsule objects to the C Data Interface. However, this would first require apache/arrow#34031 to be implemented @jorisvandenbossche :-) Also, returning a PyCapsule object may make the API lower-level and less easy to use. |
Definitely Arrow; not everything that consumes Arrow data is a DataFrame, so the additional indirection isn't going to be helpful in many cases. Meanwhile it's trivial to go from Arrow to a DataFrame (via the interchange protocol or otherwise, typically zero-copy) if that is the intended end state. PyCapsule seems like something that |
Sure, but users have to pay the cost of the PyArrow dependency even if they don't use it. That may stifle adoption. Also, depending on what your standardization goal is, a library-agnostic protocol may be an easier sell than a API returning PyArrow objects. |
Indeed, a |
As I said, it depends on what the goal is. If it's meant to be an informal spec for DB backends already producing PyArrow data, then returning PyArrow objects is obviously fine and we merely need to agree on API names and semantic details. If you want this to be more official and perhaps encourage other backends to adopt it, including backends developed by people who don't care about PyArrow, then you may get better adoption by making the spec library-agnostic. |
There is actually no conversion in this case. If the object being consumed is a Adding the correct link to the docs: https://arrow.apache.org/docs/dev/python/interchange_protocol.html
Agree. This limitations are quite strong for the case being discussed here in my opinion. |
Thanks for the valuable input @pitrou.
Not exactly the goal of this proposal, but motivated by those (incompatible) DB backends. The goal is to have a standardized interface between "Arrow in Python" and Python database connectors. IMO "Arrow in Python" doesn't necessarily mean it must be PyArrow. Although currently that's the only viable option to directly work on Table/BatchRecord objects in Python if I am informed correctly? Polars can consume Arrow data without PyArrow but you can't really interact with the Arrow data without having PyArrow installed. Having |
We can probably keep that in mind while working out that API and how PyArrow interacts with it. But it's difficult to say until it's actually implemented :-) |
I've made up my mind on this. I want to standardize something now, and I'm happy to pay the price of it being less universal and less "official". So, I'd like to go forward with this proposal with |
FYI just want to chime in that I am pushing the PyCapsule work forward now, in case that changes your feelings about that option. 😄 |
@alexander-beedie could you please have a look at the PyCapsule interface? Is this something Polars would be willing to support for data loading? If yes, it might be a better alternative to requiring PyArrow for this RFC. |
Hi all,
motivated by a Polars PR that adds support for various Arrow-compatible Python database drivers, I would like to create a standard for retrieving Arrow data from Python database drivers. Consider this an extension to PEP 249 that adds the following methods:
Cursor.arrow_fetchall()
Cursor.arrow_fetchmany()
I've created an initial draft and would love everyone's feedback on this initiative and the proposal.
The text was updated successfully, but these errors were encountered: