-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update matrix for parquet-cpp and parquet-java #100
base: production
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,14 +29,14 @@ Implementations: | |
|
||
| Data type | C++ | Java | Go | Rust | cuDF | | ||
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | | ||
| BOOLEAN | | | | | ✅ | | ||
| INT32 | | | | | ✅ | | ||
| INT64 | | | | | ✅ | | ||
| INT96 (1) | | | | | ✅ | | ||
| FLOAT | | | | | ✅ | | ||
| DOUBLE | | | | | ✅ | | ||
| BYTE_ARRAY | | | | | ✅ | | ||
| FIXED_LEN_BYTE_ARRAY | | | | | ✅ | | ||
| BOOLEAN | ✅ | ✅ | | | ✅ | | ||
| INT32 | ✅ | ✅ | | | ✅ | | ||
| INT64 | ✅ | ✅ | | | ✅ | | ||
| INT96 (1) | ✅ | ✅ | | | ✅ | | ||
| FLOAT | ✅ | ✅ | | | ✅ | | ||
| DOUBLE | ✅ | ✅ | | | ✅ | | ||
| BYTE_ARRAY | ✅ | ✅ | | | ✅ | | ||
| FIXED_LEN_BYTE_ARRAY | ✅ | ✅ | | | ✅ | | ||
|
||
* \(1) This type is deprecated, but as of 2024 it's common in currently produced parquet files | ||
|
||
|
@@ -45,64 +45,63 @@ Implementations: | |
|
||
| Data type | C++ | Java | Go | Rust | cuDF | | ||
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | | ||
| STRING | | | | | ✅ | | ||
| ENUM | | | | | ❌ | | ||
| UUID | | | | | ❌ | | ||
| 8, 16, 32, 64 bit signed and unsigned INT | | | | | ✅ | | ||
| DECIMAL (INT32) | | | | | ✅ | | ||
| DECIMAL (INT64) | | | | | ✅ | | ||
| DECIMAL (BYTE_ARRAY) | | | | | ✅ | | ||
| DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | | ✅ | | ||
| DATE | | | | | ✅ | | ||
| TIME (INT32) | | | | | ✅ | | ||
| TIME (INT64) | | | | | ✅ | | ||
| TIMESTAMP (INT64) | | | | | ✅ | | ||
| INTERVAL | | | | | ❌ | | ||
| JSON | | | | | ❌ | | ||
| BSON | | | | | ❌ | | ||
| LIST | | | | | ✅ | | ||
| MAP | | | | | ✅ | | ||
| UNKNOWN (always null) | | | | | ✅ | | ||
| FLOAT16 | | | | | ✅ | | ||
| STRING | ✅ | ✅ | | | ✅ | | ||
| ENUM | ❌ | ✅ | | | ❌ | | ||
| UUID | ❌ | ✅ | | | ❌ | | ||
| 8, 16, 32, 64 bit signed and unsigned INT | ✅ | ✅ | | | ✅ | | ||
| DECIMAL (INT32) | ✅ | ✅ | | | ✅ | | ||
| DECIMAL (INT64) | ✅ | ✅ | | | ✅ | | ||
| DECIMAL (BYTE_ARRAY) | ✅ | ✅ | | | ✅ | | ||
| DECIMAL (FIXED_LEN_BYTE_ARRAY) | ✅ | ✅ | | | ✅ | | ||
| DATE | ✅ | ✅ | | | ✅ | | ||
| TIME (INT32) | ✅ | ✅ | | | ✅ | | ||
| TIME (INT64) | ✅ | ✅ | | | ✅ | | ||
| TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ | | ||
| INTERVAL | ✅ | ✅ | | | ❌ | | ||
| JSON | ✅ | ✅ | | | ❌ | | ||
| BSON | ❌ | ✅ | | | ❌ | | ||
| LIST | ✅ | ✅ | | | ✅ | | ||
| MAP | ✅ | ✅ | | | ✅ | | ||
| UNKNOWN (always null) | ✅ | ✅ | | | ✅ | | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| FLOAT16 | ✅ | ✅ | | | ✅ | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do we mean by logical types are supported? Is it like we won't throw an exception and return something at least, or we expect to actually represent the related value correctly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think won't throw should be considered as supported. Let me remove Just to confirm that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, But, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, that's why I asked what supporting a logical type mean in terms of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, that's why I marked all logical types as supported in the initial commit. Developers are able to implement their own bindings to write and read such types by extending the
I'm not sure. Maybe it is a minimum requirement for implementation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wgtmac, If we do not want to distinguish the support level of e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, fixed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
You should then add a note or asterisk to point out that "supported" means the physical type is returned. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good. Added asterisk to explain that. |
||
|
||
### Encodings | ||
|
||
| Encoding | C++ | Java | Go | Rust | cuDF | | ||
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | | ||
| PLAIN | | | | | ✅ | | ||
| PLAIN_DICTIONARY | | | | | ✅ | | ||
| RLE_DICTIONARY | | | | | ✅ | | ||
| RLE | | | | | ✅ | | ||
| BIT_PACKED (deprecated) | | | | | (R) | | ||
| DELTA_BINARY_PACKED | | | | | ✅ | | ||
| DELTA_LENGTH_BYTE_ARRAY | | | | | ✅ | | ||
| DELTA_BYTE_ARRAY | | | | | ✅ | | ||
| BYTE_STREAM_SPLIT | | | | | ✅ | | ||
| PLAIN | ✅ | ✅ | | | ✅ | | ||
| PLAIN_DICTIONARY | ✅ | ✅ | | | ✅ | | ||
| RLE_DICTIONARY | ✅ | ✅ | | | ✅ | | ||
| RLE | ✅ | ✅ | | | ✅ | | ||
| BIT_PACKED (deprecated) | ✅ | ✅ | | | (R) | | ||
| DELTA_BINARY_PACKED | ✅ | ✅ | | | ✅ | | ||
| DELTA_LENGTH_BYTE_ARRAY | ✅ | ✅ | | | ✅ | | ||
| DELTA_BYTE_ARRAY | ✅ | ✅ | | | ✅ | | ||
| BYTE_STREAM_SPLIT | ✅ | ✅ | | | ✅ | | ||
|
||
### Compressions | ||
|
||
| Compression | C++ | Java | Go | Rust | cuDF | | ||
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | | ||
| UNCOMPRESSED | | | | | ✅ | | ||
| BROTLI | | | | | (R) | | ||
| GZIP | | | | | (R) | | ||
| LZ4 (deprecated) | | | | | ❌ | | ||
| LZ4_RAW | | | | | ✅ | | ||
| LZO | | | | | ❌ | | ||
| SNAPPY | | | | | ✅ | | ||
| ZSTD | | | | | ✅ | | ||
| UNCOMPRESSED | ✅ | ✅ | | | ✅ | | ||
| GZIP | ✅ | ✅ | | | (R) | | ||
| LZ4 (deprecated) | ✅ | ✅ | | | ❌ | | ||
| LZ4_RAW | ✅ | ✅ | | | ✅ | | ||
| LZO | ❌ | ✅ | | | ❌ | | ||
| SNAPPY | ✅ | ✅ | | | ✅ | | ||
| ZSTD | ✅ | ✅ | | | ✅ | | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Other format level features | ||
|
||
| | C++ | Java | Go | Rust | cuDF | | ||
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | | ||
| xxHash-based bloom filters | | | | | (R) | | ||
| Bloom filter length (1) | | | | | (R) | | ||
| Statistics min_value, max_value | | | | | ✅ | | ||
| Page index | | | | | ✅ | | ||
| Page CRC32 checksum | | | | | ❌ | | ||
| Modular encryption | | | | | ❌ | | ||
| Size statistics (2) | | | | | ✅ | | ||
| xxHash-based bloom filters | (R) | ✅ | | | (R) | | ||
| Bloom filter length (1) | (R) | ✅ | | | (R) | | ||
| Statistics min_value, max_value | ✅ | ✅ | | | ✅ | | ||
| Page index | ✅ | ✅ | | | ✅ | | ||
| Page CRC32 checksum | ✅ | ✅ | | | ❌ | | ||
| Modular encryption | ✅ | ✅ | | | ❌ | | ||
| Size statistics (2) | ✅ | ✅ | | | ✅ | | ||
|
||
|
||
* \(1) In parquet.thrift: ColumnMetaData->bloom_filter_length | ||
|
@@ -113,12 +112,12 @@ Implementations: | |
|
||
| Format | C++ | Java | Go | Rust | cuDF | | ||
| -------------------------------------------- | ----- | ----- | ----- | ----- | ----- | | ||
| External column data (1) | | | | | (W) | | ||
| Row group "Sorting column" metadata (2) | | | | | (W) | | ||
| Row group pruning using statistics | | | | | ✅ | | ||
| Row group pruning using bloom filter | | | | | ✅ | | ||
| Reading select columns only | | | | | ✅ | | ||
| Page pruning using statistics | | | | | ❌ | | ||
| External column data (1) | ❌ | ✅ | | | (W) | | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Row group "Sorting column" metadata (2) | ✅ | ❌ | | | (W) | | ||
| Row group pruning using statistics | ✅ | ✅ | | | ✅ | | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Row group pruning using bloom filter | ❌ | ✅ | | | ✅ | | ||
| Reading select columns only | ✅ | ✅ | | | ✅ | | ||
| Page pruning using statistics | ❌ | ✅ | | | ❌ | | ||
|
||
|
||
* \(1) In parquet.thrift: ColumnChunk->file_path | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we split this out on the unit as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that? Any unsupported unit in parquet-java?