Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Raw Insert compressed CSV? #223

Closed
Shlomixg opened this issue Jul 18, 2023 · 5 comments · Fixed by #224
Closed

How to Raw Insert compressed CSV? #223

Shlomixg opened this issue Jul 18, 2023 · 5 comments · Fixed by #224
Labels
bug Something isn't working

Comments

@Shlomixg
Copy link

Hi,
I'm trying to insert CSV that is compressed with gzip, but I'm getting this error:

  File "/usr/local/lib/python3.11/site-packages/clickhouse_connect/driver/httpclient.py", line 276, in raw_insert
    response = self._raw_request(insert_block, params, headers,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_connect/driver/httpclient.py", line 425, in _raw_request
    self._error_handler(response)
  File "/usr/local/lib/python3.11/site-packages/clickhouse_connect/driver/httpclient.py", line 350, in _error_handler
    raise OperationalError(err_str) if retried else DatabaseError(err_str) from None
clickhouse_connect.driver.exceptions.DatabaseError: :HTTPDriver for https://<server-url>:443 returned response code 500)
 Code: 354. DB::Exception: inflate failed: data error. (ZLIB_INFLATE_FAILED) (version 23.3.1.2823 (official build))

When I decompress the data using Python and insert it, the insert works flawlessly.

Steps to reproduce

ClickHouse version: 23.3.1.2823
clickhouse-connect version: 0.6.6

Inserting into ReplacingMergeTree engine table with the client:

ch_client = clickhouse_connect.get_client(
    host=HOST,
    port=int(PORT),
    username=USER,
    password=PASSWORD,
    database=DEFAULT_DB,
    compress="gzip",
)

and the command:

query_result = main.ch_client.raw_insert(
    table=f"{database_name}.{table_name}",
    insert_block=csv_data_gzipped,
    settings=settings,
    fmt="CSVWithNames",
    compression="gzip",
)

Also, is it possible to insert gzipped compressed data using the command method? If it is, how?

Thanks

@Shlomixg Shlomixg added the bug Something isn't working label Jul 18, 2023
@Shlomixg
Copy link
Author

For example, inserting into T002 table:

-- snapshots.T002 definition
CREATE TABLE snapshots.T002
(
    `SPRAS` String,
    `LASPEZ` Nullable(String),
    `LAHQ` Nullable(String),
    `LAISO` Nullable(String),
    `UPDATED_ON` DateTime64(3) DEFAULT toDateTime64(now(), 3)
)
ENGINE = ReplacingMergeTree(UPDATED_ON)
ORDER BY SPRAS
SETTINGS index_granularity = 8192;

with uncompressed data:

SPRAS;LASPEZ;LAHQ;LAISO
"0";"S";"0";"SR"
"1";"D";"0";"ZH"
"2";"M";"0";"TH"
"3";"D";"0";"KO"
"4";"S";"0";"RO"
"5";"S";"0";"SL"

and compressed:

b'\x1f\x8b\x08\x00\xd1\x82\xb6d\x02\xff\x0b\x0e\x08r\x0c\xb6\xf6q\x0c\x0ep\x8d\x02R\x1e\x81@\xc23\xd8\x9fK\xc9@\xc9Z)\x18\x88\xc1t\x90\x12\x97\x92!\x90\xe1\x02\x15\x88\xf2\x00\n\x18\x01\x19\xbeP\x81\x10\x90\x801\x92\no\x7f\xa0\x80\t\x92\x19A \x01SdC}\x94\x00\x1e\xcf\t.}\x00\x00\x00'

@genzgd genzgd added enhancement New feature or request bug Something isn't working and removed bug Something isn't working enhancement New feature or request labels Jul 18, 2023
@genzgd
Copy link
Collaborator

genzgd commented Jul 18, 2023

My apologies, this was recently broken by the PR that moved the query from an HTTP query param to the POST body. When inserting binary data, the query itself was not properly compressed. I'm working on releasing a fix.

@genzgd genzgd linked a pull request Jul 18, 2023 that will close this issue
2 tasks
@genzgd
Copy link
Collaborator

genzgd commented Jul 18, 2023

This should work correctly now in the new 0.6.8 version. Here's the new test case for reference:

data_file = f'{Path(__file__).parent}/movies.csv.gz'
with open(data_file, mode='rb') as movies_file:
    data = movies_file.read()
with table_context('test_gzip_movies', ['movie String', 'year UInt16', 'rating Decimal32(3)']):
    insert_result = test_client.raw_insert('test_gzip_movies', None, data, fmt='CSV', compression='gzip',
                                           settings={'input_format_allow_errors_ratio': .2,
                                                     'input_format_allow_errors_num': 5}
                                           )
    assert 248 == insert_result.written_rows

I also added a compression parameter to the tools.insert_file method, so if you are inserting directly from the file system, this should work:

import clickhouse_connect 
from clickhouse_connect.driver.tools import insert_file

client = clickhouse_connect.get_client()
data_file = 'movies.csv.gz'
insert_result = insert_file(test_client, 'movies_table',  data_file)

The insert_file will infer a gzip compression method if it sees the .gz or .gzip file suffix.

Thanks for the report and the detailed information to reproduce.

@Shlomixg
Copy link
Author

Great, it's working now :)

Is it possible to insert gzipped compressed data using the command method?
We are performing a SELECT from INPUT(), so the raw insert is not detailed enough for this use case.

@genzgd
Copy link
Collaborator

genzgd commented Jul 19, 2023

I forgot about that question. You can insert data using the command function by the same mechanism that the raw_insert method uses, by creating an INSERT INTO query and putting the data into the POST body, but there's no reason to use the command method instead of raw_insert. The command function also doesn't allow setting compression, but that's fairly easy to add if the POST data is compressed -- but again, the main use case for that is covered by raw_insert.

You cannot use SELECT FROM input in clickhouse-connect. SELECT FROM input leverages the full power of the clickhouse-client to read from stdin, applies the input table function which actually executes on the client and reads from stdin, and streams the results over the Native protocol to the ClickHouse server. This is currently not possible over HTTP, and would require leveraging much of the internal ClickHouse C++ code to get the full power of the input table function. (The C++ clickhouse-client and clickhouse-server use the same executable, which is several hundred megabytes in size).

You can achieve somewhat similar results using external data with the command function and sending the data to the ClickHouse server as HTTP multi-part form data for processing. Unfortunately, the clickhouse-connect ExternalData code doesn't currently support compression, and compressing multipart form data only works for the entire message, so you can't just send the compressed CSV data separately. It would have to be unzipped and clickhouse-connect would have to recompress the entire HTTP request. That's doable, but it require some code changes and in most cases probably is not be worth the effort depending on your reasons for compression.

Otherwise you might want to stick with clickhouse-client for that particular application if you really need the SELECT FROM input functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants