-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
access data after load load as dataframes with ibis #1095
Comments
hi @rudolfix, I'm working on Ibis and we were just discussing
we'd be happy to help move this along, particularly if there are any questions we can answer. in my cursory look at |
Hi to you both! I recently spent a decent amount of time with dlt + Ibis and I think there's a very clean abstraction to hand-off dlt to Ibis. dlt sideFrom the dlt perspective, users pass credentials to create a connection to their pipeline.sql_client().open_connection() ibis sideIn the upcoming Ibis major release, backends are assigned a integrationTo hand-off the connection from dlt to Ibis, I got this working import ibis
import dlt
pipeline = dlt.pipeline(destination="duckdb, ...)
ibis.set_backend("duckdb")
ibis_connection = ibis.get_backend() # will return non-connected backend
ibis_connection.con = pipeline.sql_client().open_connection()
ibis_connection.list_tables() # will successfully read data TODO
with pipeline.ibis_client() as client:
client.list_tables()
|
@lostmygithubaccount @zilto integrating ibis via What about reading parquet files? There's a way to register a parquet file and query it. Are we able to register parquet files with My goal here is to use ibis as the dataframe engine :) and expose it as I imagined in the initial post. so whenever users want to interact with dataset via dataframes, they get ibis client, if they want to interact via sql they get (more or less) dbapi connection. the interface is partially inspired by what duckdb does. what is your take on this? maybe I go to far with hiding what is really under the hood. |
here we could to two things:
|
Doing ELTWith Extract, Transform, Load ( To make this possible:
^This is where there's immediate value, just needs a bit of coordination Doing ETLI'm now more familiar about the dlt internals (extract, normalize, transform), using the Ibis code is primarily about building "expressions" until an "execution operation" (e.g., insert data, return as dataframe). To start defining expressions, Ibis needs a The dlt schema evolution / data contract + the Ibis and Substrait relational algebra could provide full lineage and granular "diff" and visibility over breaking changes |
I have experimented a bit with this here: #1491. There is no proper way to hand over native connections to ibis backends at the moment. For the moment I am getting the backends and just setting the .con property, but this does not work for most destinations, so there'd have to be some work on the ibis project to get this to work. |
@lostmygithubaccount are there any plans to allow sharing of an open connection with ibis? You can see in my code that I am just setting the |
hi @sh-rp, let me try to pull in one of the more experienced engineers on the team -- some initial answers:
I don't know if using this while using Ibis at the same time is well-defined behavior then there is an open issue w/ this ask: ibis-project/ibis#8877 |
@lostmygithubaccount Ah yes, thanks for pointing me to that issue, that is exactly what I'd need. I'll comment there. |
Background
ibis https://github.com/ibis-project/ibis is a library that translates dataframe expressions into SQL statement and then executes them in the destination. they do nice work of compiling final SQL statement with sqlglot (so probably resultant SQL is quite optimized)
We have large overlap in destinations and we were looking for decent dataframe -> sql thing from the very start. it seems that's it: we can easily build a helper that exposes any dlt dataset as dataframe, share credentials etc.
Implementation
We can integrate deeply or via a helper scheme. In case of helper, we allow users to get
ibis
connection fromdlt
destination and/or pipeline. The UX will be similar todbt
helper.Deep integration means that we expose the loaded data from the
Pipeline
,DltSource
andDltResource
instances. ie.Implementation is straightforward for sql-like destinations. We won't support vector databases.
It would be really interesting to support filesystem destination as above. ie. by registering the json and parquet files in temporary duckdb database and then exposing the database for
ibis
andsql
access methods** Ibis Connection sharing**
We are discussing a connection sharing approach with ibis here: ibis-project/ibis#8877. As mentioned in the comments there, we could build it in a way that we manage the connection and ibis provides backends that accept an open connection and DO NOT need any addtionally dependencies.
The text was updated successfully, but these errors were encountered: