-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support a query_arrow_stream
API that produces a pyarrow.RecordBatchReader
#155
Comments
An iterator of |
This would be great for batch-processing, and looks like it could reduce memory-consumption if the client reads the batches while reading from the |
Just a note that I'm not ignoring this -- when we initially added Arrow support by leveraging the ClickHouse Arrow format, I couldn't get the ArrowStream variant to work at all. It might be related to trying to do it over HTTP, it might be something else that I couldn't figure out, or it might be that something isn't as expected in the back end ClickHouse support. It may be a while before I can dig into it, but of course community contributions are alway welcome. |
I'm using below snippet to feed record batches into polars.from_arrow function - it appears to produce dataframe x1.2-1.5 faster compared to client.query_arrow(), yields almost same throughput as curl format ArrowStream
|
Nice, thanks for the code snippet! |
Closed with PR #321 |
Is your feature request related to a problem? Please describe.
My feature request is not related to a problem.
Describe the solution you'd like
I'd like to be able to run the following code:
Describe alternatives you've considered
The alternatives are:
query_arrow
. This defeats the purpose of streaming results back.query_df_stream
. This is again wasteful and has the additional downside of being lossy, because pandas has poor support for data types, especially nullable data types.The text was updated successfully, but these errors were encountered: