Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LZ4-compressed PQ files unreadable by Pandas and ClickHouse #3433

Closed
marklit opened this issue Jan 3, 2023 · 1 comment
Closed

LZ4-compressed PQ files unreadable by Pandas and ClickHouse #3433

marklit opened this issue Jan 3, 2023 · 1 comment
Labels
bug parquet Changes to the parquet crate

Comments

@marklit
Copy link

marklit commented Jan 3, 2023

Versions:

  • json2parquet 0.6.0 with the following Cargo packages:
    • parquet = "29.0.0" (this is in the main branch but the file metadata states 23.0.0 for some reason)
    • arrow = "29.0.0"
    • arrow-schema = { version = "29.0.0", features = ["serde"] }
  • PyArrow 10.0.1
  • ClickHouse 22.13.1.1119
$ vi test.jsonl
{"area": 123, "geom": "", "centroid_x": -86.86346599122807, "centroid_y": 34.751296108771925, "h3_7": "872649315ffffff", "h3_8": "882649315dfffff", "h3_9": "892649315cbffff"}
$ json2parquet -c lz4 test.jsonl lz4.pq
$ ls -lth lz4.pq # 2.5K
$ hexdump -C lz4.pq | head; echo; hexdump -C lz4.pq | tail
00000000  50 41 52 31 15 00 15 1c  15 42 2c 15 02 15 00 15  |PAR1.....B,.....|
00000010  06 15 06 1c 58 08 7b 00  00 00 00 00 00 00 18 08  |....X.{.........|
00000020  7b 00 00 00 00 00 00 00  00 00 00 04 22 4d 18 44  |{..........."M.D|
00000030  40 5e 0e 00 00 80 02 00  00 00 02 01 7b 00 00 00  |@^..........{...|
00000040  00 00 00 00 00 00 00 00  e4 c0 1d d2 15 04 19 25  |...............%|
00000050  00 06 19 18 04 61 72 65  61 15 0a 16 02 16 6a 16  |.....area.....j.|
00000060  90 01 26 08 3c 58 08 7b  00 00 00 00 00 00 00 18  |..&.<X.{........|
00000070  08 7b 00 00 00 00 00 00  00 00 00 15 00 15 1c 15  |.{..............|
00000080  42 2c 15 02 15 00 15 06  15 06 1c 58 08 19 62 dc  |B,.........X..b.|
00000090  06 43 b7 55 c0 18 08 19  62 dc 06 43 b7 55 c0 00  |.C.U....b..C.U..|

00000970  41 42 51 41 45 41 41 4f  41 41 38 41 42 41 41 41  |ABQAEAAOAA8ABAAA|
00000980  41 41 67 41 45 41 41 41  41 42 67 41 41 41 41 67  |AAgAEAAAABgAAAAg|
00000990  41 41 41 41 41 41 41 42  41 68 77 41 41 41 41 49  |AAAAAAABAhwAAAAI|
000009a0  41 41 77 41 42 41 41 4c  41 41 67 41 41 41 42 41  |AAwABAALAAgAAABA|
000009b0  41 41 41 41 41 41 41 41  41 51 41 41 41 41 41 45  |AAAAAAAAAQAAAAAE|
000009c0  41 41 41 41 59 58 4a 6c  59 51 41 41 41 41 41 3d  |AAAAYXJlYQAAAAA=|
000009d0  00 18 19 70 61 72 71 75  65 74 2d 72 73 20 76 65  |...parquet-rs ve|
000009e0  72 73 69 6f 6e 20 32 33  2e 30 2e 30 00 fe 04 00  |rsion 23.0.0....|
000009f0  00 50 41 52 31                                    |.PAR1|
000009f5
$ ipython
In [1]: import pyarrow.parquet as pq

In [2]: pf = pq.ParquetFile('lz4.pq')

In [3]: pf
Out[3]: <pyarrow.parquet.core.ParquetFile at 0x10ca1cd90>

In [4]: pf.schema
Out[4]:
<pyarrow._parquet.ParquetSchema object at 0x10e74b280>
required group field_id=-1 arrow_schema {
  optional int64 field_id=-1 area;
  optional double field_id=-1 centroid_x;
  optional double field_id=-1 centroid_y;
  optional binary field_id=-1 geom (String);
  optional binary field_id=-1 h3_7 (String);
  optional binary field_id=-1 h3_8 (String);
  optional binary field_id=-1 h3_9 (String);
}

In [6]: pf.read()

# OSError: Corrupt Lz4 compressed data.
$ clickhouse client
CREATE TABLE pq_test (
    area Nullable(Int64),
    centroid_x Nullable(Float64),
    centroid_y Nullable(Float64),
    geom Nullable(String),
    h3_7 Nullable(String),
    h3_8 Nullable(String),
    h3_9 Nullable(String))
ENGINE = "Log";
$ clickhouse client \
    --query='INSERT INTO pq_test FORMAT Parquet' \
    < lz4.pq
Code: 33. DB::ParsingException: Error while reading Parquet data: IOError: Corrupt Lz4 compressed data.: While executing ParquetBlockInputFormat: data for INSERT was parsed from stdin: (in query: INSERT INTO pq_test FORMAT Parquet). (CANNOT_READ_ALL_DATA)
@marklit
Copy link
Author

marklit commented Jan 3, 2023

The project bumped its arrow version from 23 to 29 two days ago and hadn't produced a new release. I've built the project from its main branch and the above is working without issue now.

@marklit marklit closed this as completed Jan 3, 2023
@tustvold tustvold added the parquet Changes to the parquet crate label Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

2 participants