Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update matrix for parquet-cpp and parquet-java #100

Open
wants to merge 5 commits into
base: production
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 59 additions & 57 deletions content/en/docs/File Format/implementationstatus.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,14 +29,14 @@ Implementations:

| Data type | C++ | Java | Go | Rust | cuDF |
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- |
| BOOLEAN | | | | | ✅ |
| INT32 | | | | | ✅ |
| INT64 | | | | | ✅ |
| INT96 (1) | | | | | ✅ |
| FLOAT | | | | | ✅ |
| DOUBLE | | | | | ✅ |
| BYTE_ARRAY | | | | | ✅ |
| FIXED_LEN_BYTE_ARRAY | | | | | ✅ |
| BOOLEAN | | | | | ✅ |
| INT32 | | | | | ✅ |
| INT64 | | | | | ✅ |
| INT96 (1) | | | | | ✅ |
| FLOAT | | | | | ✅ |
| DOUBLE | | | | | ✅ |
| BYTE_ARRAY | | | | | ✅ |
| FIXED_LEN_BYTE_ARRAY | | | | | ✅ |

* \(1) This type is deprecated, but as of 2024 it's common in currently produced parquet files

Expand All @@ -45,64 +45,66 @@ Implementations:

| Data type | C++ | Java | Go | Rust | cuDF |
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- |
| STRING | | | | | ✅ |
| ENUM | | | | | ❌ |
| UUID | | | | | ❌ |
| 8, 16, 32, 64 bit signed and unsigned INT | | | | | ✅ |
| DECIMAL (INT32) | | | | | ✅ |
| DECIMAL (INT64) | | | | | ✅ |
| DECIMAL (BYTE_ARRAY) | | | | | ✅ |
| DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | | ✅ |
| DATE | | | | | ✅ |
| TIME (INT32) | | | | | ✅ |
| TIME (INT64) | | | | | ✅ |
| TIMESTAMP (INT64) | | | | | ✅ |
| INTERVAL | | | | | ❌ |
| JSON | | | | | ❌ |
| BSON | | | | | ❌ |
| LIST | | | | | ✅ |
| MAP | | | | | ✅ |
| UNKNOWN (always null) | | | | | ✅ |
| FLOAT16 | | | | | ✅ |
| STRING | ✅ | ✅ | | | ✅ |
| ENUM | ❌ | ✅ | | | ❌ |
| UUID | ❌ | ✅ | | | ❌ |
| 8, 16, 32, 64 bit signed and unsigned INT | ✅ | ✅ | | | ✅ |
| DECIMAL (INT32) | ✅ | ✅ | | | ✅ |
| DECIMAL (INT64) | ✅ | ✅ | | | ✅ |
| DECIMAL (BYTE_ARRAY) | ✅ | ✅ | | | ✅ |
| DECIMAL (FIXED_LEN_BYTE_ARRAY) | ✅ | ✅ | | | ✅ |
| DATE | ✅ | ✅ | | | ✅ |
| TIME (INT32) | ✅ | ✅ | | | ✅ |
| TIME (INT64) | ✅ | ✅ | | | ✅ |
| TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we split this out on the unit as well?

Suggested change
| TIMESTAMP (INT64) ||| | ||
| TIMESTAMP (INT64, MICROS) ||| | ||

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? Any unsupported unit in parquet-java?

| INTERVAL | ✅ | ✅(*)| | | ❌ |
| JSON | ✅ | ✅(*)| | | ❌ |
| BSON | ❌ | ✅(*)| | | ❌ |
| LIST | ✅ | ✅ | | | ✅ |
| MAP | ✅ | ✅ | | | ✅ |
| UNKNOWN (always null) | ✅ | ✅ | | | ✅ |
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
| FLOAT16 | ✅ | ✅(*)| | | ✅ |

(*): Only supported to use its annotated physical type

### Encodings

| Encoding | C++ | Java | Go | Rust | cuDF |
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- |
| PLAIN | | | | | ✅ |
| PLAIN_DICTIONARY | | | | | ✅ |
| RLE_DICTIONARY | | | | | ✅ |
| RLE | | | | | ✅ |
| BIT_PACKED (deprecated) | | | | | (R) |
| DELTA_BINARY_PACKED | | | | | ✅ |
| DELTA_LENGTH_BYTE_ARRAY | | | | | ✅ |
| DELTA_BYTE_ARRAY | | | | | ✅ |
| BYTE_STREAM_SPLIT | | | | | ✅ |
| PLAIN | | | | | ✅ |
| PLAIN_DICTIONARY | | | | | ✅ |
| RLE_DICTIONARY | | | | | ✅ |
| RLE | | | | | ✅ |
| BIT_PACKED (deprecated) | | | | | (R) |
| DELTA_BINARY_PACKED | | | | | ✅ |
| DELTA_LENGTH_BYTE_ARRAY | | | | | ✅ |
| DELTA_BYTE_ARRAY | | | | | ✅ |
| BYTE_STREAM_SPLIT | | | | | ✅ |

### Compressions

| Compression | C++ | Java | Go | Rust | cuDF |
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- |
| UNCOMPRESSED | | | | | ✅ |
| BROTLI | | | | | (R) |
| GZIP | | | | | (R) |
| LZ4 (deprecated) | | | | | ❌ |
| LZ4_RAW | | | | | ✅ |
| LZO | | | | | ❌ |
| SNAPPY | | | | | ✅ |
| ZSTD | | | | | ✅ |
| UNCOMPRESSED | | | | | ✅ |
| BROTLI | | | | | (R) |
| GZIP | | | | | (R) |
| LZ4 (deprecated) | | | | | ❌ |
| LZ4_RAW | | | | | ✅ |
| LZO | | | | | ❌ |
| SNAPPY | | | | | ✅ |
| ZSTD | | | | | ✅ |

### Other format level features

| | C++ | Java | Go | Rust | cuDF |
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- |
| xxHash-based bloom filters | | | | | (R) |
| Bloom filter length (1) | | | | | (R) |
| Statistics min_value, max_value | | | | | ✅ |
| Page index | | | | | ✅ |
| Page CRC32 checksum | | | | | ❌ |
| Modular encryption | | | | | ❌ |
| Size statistics (2) | | | | | ✅ |
| xxHash-based bloom filters | (R) | | | | (R) |
| Bloom filter length (1) | (R) | | | | (R) |
| Statistics min_value, max_value | | | | | ✅ |
| Page index | | | | | ✅ |
| Page CRC32 checksum | | | | | ❌ |
| Modular encryption | | | | | ❌ |
| Size statistics (2) | | | | | ✅ |


* \(1) In parquet.thrift: ColumnMetaData->bloom_filter_length
Expand All @@ -113,12 +115,12 @@ Implementations:

| Format | C++ | Java | Go | Rust | cuDF |
| -------------------------------------------- | ----- | ----- | ----- | ----- | ----- |
| External column data (1) | | | | | (W) |
| Row group "Sorting column" metadata (2) | | | | | (W) |
| Row group pruning using statistics | | | | | ✅ |
| Row group pruning using bloom filter | | | | | ✅ |
| Reading select columns only | | | | | ✅ |
| Page pruning using statistics | | | | | ❌ |
| External column data (1) | | | | | (W) |
| Row group "Sorting column" metadata (2) | | | | | (W) |
| Row group pruning using statistics | | | | | ✅ |
| Row group pruning using bloom filter | | | | | ✅ |
| Reading select columns only | | | | | ✅ |
| Page pruning using statistics | | | | | ❌ |


* \(1) In parquet.thrift: ColumnChunk->file_path
Expand Down