Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a query_arrow_stream API that produces a pyarrow.RecordBatchReader #155

Closed
cpcloud opened this issue Mar 24, 2023 · 6 comments
Closed
Labels
enhancement New feature or request

Comments

@cpcloud
Copy link

cpcloud commented Mar 24, 2023

Is your feature request related to a problem? Please describe.

My feature request is not related to a problem.

Describe the solution you'd like

I'd like to be able to run the following code:

>>> import clickhouse_connect
>>> import pyarrow as pa
>>> con = clickhouse_connect.get_client(...)
>>> with con.query_arrow_stream(...) as stream:
...     assert isinstance(stream, pa.RecordBatchReader)

Describe alternatives you've considered

The alternatives are:

  1. Building a record batch reader from column blocks. This has the overhead of converting from clickhouse native to python and then back to arrow, which is extremely wasteful and inefficient.
  2. Building a record batch reader from row blocks. This is strictly worse than solution 1 as it requires additional work to convert data into columnar form.
  3. Building a record batch reader from query_arrow. This defeats the purpose of streaming results back.
  4. Building a record batch reader from query_df_stream. This is again wasteful and has the additional downside of being lossy, because pandas has poor support for data types, especially nullable data types.
@cpcloud cpcloud added the enhancement New feature or request label Mar 24, 2023
@cpcloud
Copy link
Author

cpcloud commented Mar 24, 2023

An iterator of pa.RecordBatch would also be acceptable, and slightly less efficient.

@rbeeli
Copy link

rbeeli commented Apr 12, 2023

This would be great for batch-processing, and looks like it could reduce memory-consumption if the client reads the batches while reading from the RecordBatchReader instance on-demand (not sure this is possible currently with the ClickHouse Arrow implementation).

@genzgd
Copy link
Collaborator

genzgd commented Apr 12, 2023

Just a note that I'm not ignoring this -- when we initially added Arrow support by leveraging the ClickHouse Arrow format, I couldn't get the ArrowStream variant to work at all. It might be related to trying to do it over HTTP, it might be something else that I couldn't figure out, or it might be that something isn't as expected in the back end ClickHouse support. It may be a while before I can dig into it, but of course community contributions are alway welcome.

@bepec
Copy link

bepec commented Sep 20, 2023

I'm using below snippet to feed record batches into polars.from_arrow function - it appears to produce dataframe x1.2-1.5 faster compared to client.query_arrow(), yields almost same throughput as curl format ArrowStream

def arrow_from_query(query):
    conn = http.client.HTTPConnection(AVT_CLICKHOUSE_HOST, port=AVT_CLICKHOUSE_PORT)
    query += " FORMAT ArrowStream SETTINGS output_format_arrow_string_as_string=1"
    conn.request("POST", url="/", body=query)
    with conn.getresponse() as resp:
        with pyarrow.ipc.open_stream(resp) as sin:
            result = list(sin) # sin is record batches generator
        resp.read() # allows ch server to close connection gracefully
    conn.close()
    return result

@genzgd
Copy link
Collaborator

genzgd commented Sep 20, 2023

Nice, thanks for the code snippet!

@NotSimone NotSimone mentioned this issue Mar 24, 2024
3 tasks
@genzgd
Copy link
Collaborator

genzgd commented Mar 24, 2024

Closed with PR #321

@genzgd genzgd closed this as completed Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants