Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update matrix for parquet-cpp and parquet-java #100

Open
wants to merge 5 commits into
base: production
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 56 additions & 57 deletions content/en/docs/File Format/implementationstatus.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,14 +29,14 @@ Implementations:

| Data type | C++ | Java | Go | Rust | cuDF |
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- |
| BOOLEAN | | | | | ✅ |
| INT32 | | | | | ✅ |
| INT64 | | | | | ✅ |
| INT96 (1) | | | | | ✅ |
| FLOAT | | | | | ✅ |
| DOUBLE | | | | | ✅ |
| BYTE_ARRAY | | | | | ✅ |
| FIXED_LEN_BYTE_ARRAY | | | | | ✅ |
| BOOLEAN | | | | | ✅ |
| INT32 | | | | | ✅ |
| INT64 | | | | | ✅ |
| INT96 (1) | | | | | ✅ |
| FLOAT | | | | | ✅ |
| DOUBLE | | | | | ✅ |
| BYTE_ARRAY | | | | | ✅ |
| FIXED_LEN_BYTE_ARRAY | | | | | ✅ |

* \(1) This type is deprecated, but as of 2024 it's common in currently produced parquet files

Expand All @@ -45,64 +45,63 @@ Implementations:

| Data type | C++ | Java | Go | Rust | cuDF |
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- |
| STRING | | | | | ✅ |
| ENUM | | | | | ❌ |
| UUID | | | | | ❌ |
| 8, 16, 32, 64 bit signed and unsigned INT | | | | | ✅ |
| DECIMAL (INT32) | | | | | ✅ |
| DECIMAL (INT64) | | | | | ✅ |
| DECIMAL (BYTE_ARRAY) | | | | | ✅ |
| DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | | ✅ |
| DATE | | | | | ✅ |
| TIME (INT32) | | | | | ✅ |
| TIME (INT64) | | | | | ✅ |
| TIMESTAMP (INT64) | | | | | ✅ |
| INTERVAL | | | | | ❌ |
| JSON | | | | | ❌ |
| BSON | | | | | ❌ |
| LIST | | | | | ✅ |
| MAP | | | | | ✅ |
| UNKNOWN (always null) | | | | | ✅ |
| FLOAT16 | | | | | ✅ |
| STRING | | | | | ✅ |
| ENUM | | | | | ❌ |
| UUID | | | | | ❌ |
| 8, 16, 32, 64 bit signed and unsigned INT | | | | | ✅ |
| DECIMAL (INT32) | | | | | ✅ |
| DECIMAL (INT64) | | | | | ✅ |
| DECIMAL (BYTE_ARRAY) | | | | | ✅ |
| DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | | ✅ |
| DATE | | | | | ✅ |
| TIME (INT32) | | | | | ✅ |
| TIME (INT64) | | | | | ✅ |
| TIMESTAMP (INT64) | | | | | ✅ |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we split this out on the unit as well?

Suggested change
| TIMESTAMP (INT64) ||| | ||
| TIMESTAMP (INT64, MICROS) ||| | ||

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? Any unsupported unit in parquet-java?

| INTERVAL | | | | | ❌ |
| JSON | | | | | ❌ |
| BSON | | | | | ❌ |
| LIST | | | | | ✅ |
| MAP | | | | | ✅ |
| UNKNOWN (always null) | | | | | ✅ |
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
| FLOAT16 | | | | | ✅ |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we mean by logical types are supported? Is it like we won't throw an exception and return something at least, or we expect to actually represent the related value correctly?
Since parquet-java does not have its own representation type system (except the one in the example package that mainly handles the primitive types only), the question is whether at least one binding supports the related type. I am not aware that any binding supports INTERVAL or BSON. JSON is mostly handle as string without any syntax check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think won't throw should be considered as supported. Let me remove INTERVAL and BSON.

Just to confirm that ENUM and UUID are supported by the Avro binding, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, UUID is mapped to the Avro UUID type, ENUM is mapped to string.

But, JSON and FLOAT16 do not seem to be supported either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think FLOAT16 should be considered as supported: apache/parquet-java#1142

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, that's why I asked what supporting a logical type mean in terms of parquet-java where we do not have our own representation type system. For FLOAT16, we return the containing 2 bytes as a Binary in our example package. We do not map this type in any of our bindings.
If we consider FLOAT16 as supported, then all the others are supported as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we consider FLOAT16 as supported, then all the others are supported as well.

Yes, that's why I marked all logical types as supported in the initial commit. Developers are able to implement their own bindings to write and read such types by extending the parquet-java library.

Is there value in adding another option here (P perhaps) that means the implementation at least supports the logical type enum value but will return the physical type?

I'm not sure. Maybe it is a minimum requirement for implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, If we do not want to distinguish the support level of e.g. FLOAT16 and UUID, I'm fine having all marked as supported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why I marked all logical types as supported in the initial commit. Developers are able to implement their own bindings to write and read such types by extending the parquet-java library.

You should then add a note or asterisk to point out that "supported" means the physical type is returned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Added asterisk to explain that.


### Encodings

| Encoding | C++ | Java | Go | Rust | cuDF |
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- |
| PLAIN | | | | | ✅ |
| PLAIN_DICTIONARY | | | | | ✅ |
| RLE_DICTIONARY | | | | | ✅ |
| RLE | | | | | ✅ |
| BIT_PACKED (deprecated) | | | | | (R) |
| DELTA_BINARY_PACKED | | | | | ✅ |
| DELTA_LENGTH_BYTE_ARRAY | | | | | ✅ |
| DELTA_BYTE_ARRAY | | | | | ✅ |
| BYTE_STREAM_SPLIT | | | | | ✅ |
| PLAIN | | | | | ✅ |
| PLAIN_DICTIONARY | | | | | ✅ |
| RLE_DICTIONARY | | | | | ✅ |
| RLE | | | | | ✅ |
| BIT_PACKED (deprecated) | | | | | (R) |
| DELTA_BINARY_PACKED | | | | | ✅ |
| DELTA_LENGTH_BYTE_ARRAY | | | | | ✅ |
| DELTA_BYTE_ARRAY | | | | | ✅ |
| BYTE_STREAM_SPLIT | | | | | ✅ |

### Compressions

| Compression | C++ | Java | Go | Rust | cuDF |
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- |
| UNCOMPRESSED | | | | | ✅ |
| BROTLI | | | | | (R) |
| GZIP | | | | | (R) |
| LZ4 (deprecated) | | | | | ❌ |
| LZ4_RAW | | | | | ✅ |
| LZO | | | | | ❌ |
| SNAPPY | | | | | ✅ |
| ZSTD | | | | | ✅ |
| UNCOMPRESSED | ✅ | ✅ | | | ✅ |
| GZIP | ✅ | ✅ | | | (R) |
| LZ4 (deprecated) | ✅ | ✅ | | | ❌ |
| LZ4_RAW | ✅ | ✅ | | | ✅ |
| LZO | ❌ | ✅ | | | ❌ |
| SNAPPY | ✅ | ✅ | | | ✅ |
| ZSTD | ✅ | ✅ | | | ✅ |
wgtmac marked this conversation as resolved.
Show resolved Hide resolved

### Other format level features

| | C++ | Java | Go | Rust | cuDF |
| ----------------------------------------- | ----- | ----- | ----- | ----- | ----- |
| xxHash-based bloom filters | | | | | (R) |
| Bloom filter length (1) | | | | | (R) |
| Statistics min_value, max_value | | | | | ✅ |
| Page index | | | | | ✅ |
| Page CRC32 checksum | | | | | ❌ |
| Modular encryption | | | | | ❌ |
| Size statistics (2) | | | | | ✅ |
| xxHash-based bloom filters | (R) | | | | (R) |
| Bloom filter length (1) | (R) | | | | (R) |
| Statistics min_value, max_value | | | | | ✅ |
| Page index | | | | | ✅ |
| Page CRC32 checksum | | | | | ❌ |
| Modular encryption | | | | | ❌ |
| Size statistics (2) | | | | | ✅ |


* \(1) In parquet.thrift: ColumnMetaData->bloom_filter_length
Expand All @@ -113,12 +112,12 @@ Implementations:

| Format | C++ | Java | Go | Rust | cuDF |
| -------------------------------------------- | ----- | ----- | ----- | ----- | ----- |
| External column data (1) | | | | | (W) |
| Row group "Sorting column" metadata (2) | | | | | (W) |
| Row group pruning using statistics | | | | | ✅ |
| Row group pruning using bloom filter | | | | | ✅ |
| Reading select columns only | | | | | ✅ |
| Page pruning using statistics | | | | | ❌ |
| External column data (1) | | | | | (W) |
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
| Row group "Sorting column" metadata (2) | | | | | (W) |
| Row group pruning using statistics | | | | | ✅ |
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
| Row group pruning using bloom filter | | | | | ✅ |
| Reading select columns only | | | | | ✅ |
| Page pruning using statistics | | | | | ❌ |


* \(1) In parquet.thrift: ColumnChunk->file_path
Expand Down