Support a `query_arrow_stream` API that produces a `pyarrow.RecordBatchReader` #155

cpcloud · 2023-03-24T15:36:32Z

Is your feature request related to a problem? Please describe.

My feature request is not related to a problem.

Describe the solution you'd like

I'd like to be able to run the following code:

>>> import clickhouse_connect
>>> import pyarrow as pa
>>> con = clickhouse_connect.get_client(...)
>>> with con.query_arrow_stream(...) as stream:
...     assert isinstance(stream, pa.RecordBatchReader)

Describe alternatives you've considered

The alternatives are:

Building a record batch reader from column blocks. This has the overhead of converting from clickhouse native to python and then back to arrow, which is extremely wasteful and inefficient.
Building a record batch reader from row blocks. This is strictly worse than solution 1 as it requires additional work to convert data into columnar form.
Building a record batch reader from query_arrow. This defeats the purpose of streaming results back.
Building a record batch reader from query_df_stream. This is again wasteful and has the additional downside of being lossy, because pandas has poor support for data types, especially nullable data types.

The text was updated successfully, but these errors were encountered:

cpcloud · 2023-03-24T15:37:13Z

An iterator of pa.RecordBatch would also be acceptable, and slightly less efficient.

rbeeli · 2023-04-12T01:01:44Z

This would be great for batch-processing, and looks like it could reduce memory-consumption if the client reads the batches while reading from the RecordBatchReader instance on-demand (not sure this is possible currently with the ClickHouse Arrow implementation).

genzgd · 2023-04-12T01:57:34Z

Just a note that I'm not ignoring this -- when we initially added Arrow support by leveraging the ClickHouse Arrow format, I couldn't get the ArrowStream variant to work at all. It might be related to trying to do it over HTTP, it might be something else that I couldn't figure out, or it might be that something isn't as expected in the back end ClickHouse support. It may be a while before I can dig into it, but of course community contributions are alway welcome.

bepec · 2023-09-20T12:52:56Z

I'm using below snippet to feed record batches into polars.from_arrow function - it appears to produce dataframe x1.2-1.5 faster compared to client.query_arrow(), yields almost same throughput as curl format ArrowStream

def arrow_from_query(query):
    conn = http.client.HTTPConnection(AVT_CLICKHOUSE_HOST, port=AVT_CLICKHOUSE_PORT)
    query += " FORMAT ArrowStream SETTINGS output_format_arrow_string_as_string=1"
    conn.request("POST", url="/", body=query)
    with conn.getresponse() as resp:
        with pyarrow.ipc.open_stream(resp) as sin:
            result = list(sin) # sin is record batches generator
        resp.read() # allows ch server to close connection gracefully
    conn.close()
    return result

genzgd · 2023-09-20T13:18:13Z

Nice, thanks for the code snippet!

genzgd · 2024-03-24T17:12:44Z

Closed with PR #321

cpcloud added the enhancement New feature or request label Mar 24, 2023

NotSimone mentioned this issue Mar 24, 2024

Add query_arrow_stream #321

Merged

3 tasks

genzgd closed this as completed Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a `query_arrow_stream` API that produces a `pyarrow.RecordBatchReader` #155

Support a `query_arrow_stream` API that produces a `pyarrow.RecordBatchReader` #155

cpcloud commented Mar 24, 2023

cpcloud commented Mar 24, 2023

rbeeli commented Apr 12, 2023

genzgd commented Apr 12, 2023 •

edited

Loading

bepec commented Sep 20, 2023

genzgd commented Sep 20, 2023 •

edited

Loading

genzgd commented Mar 24, 2024

Support a query_arrow_stream API that produces a pyarrow.RecordBatchReader #155

Support a query_arrow_stream API that produces a pyarrow.RecordBatchReader #155

Comments

cpcloud commented Mar 24, 2023

cpcloud commented Mar 24, 2023

rbeeli commented Apr 12, 2023

genzgd commented Apr 12, 2023 • edited Loading

bepec commented Sep 20, 2023

genzgd commented Sep 20, 2023 • edited Loading

genzgd commented Mar 24, 2024

Support a `query_arrow_stream` API that produces a `pyarrow.RecordBatchReader` #155

Support a `query_arrow_stream` API that produces a `pyarrow.RecordBatchReader` #155

genzgd commented Apr 12, 2023 •

edited

Loading

genzgd commented Sep 20, 2023 •

edited

Loading