Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update matrix for parquet-cpp and parquet-java #100

Open
wants to merge 5 commits into
base: production
Choose a base branch
from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Feb 5, 2025

No description provided.

@wgtmac
Copy link
Member Author

wgtmac commented Feb 5, 2025

Comment on lines 48 to 66
| STRING | | | | | ✅ |
| ENUM | | | | | ❌ |
| UUID | | | | | ❌ |
| 8, 16, 32, 64 bit signed and unsigned INT | | | | | ✅ |
| DECIMAL (INT32) | | | | | ✅ |
| DECIMAL (INT64) | | | | | ✅ |
| DECIMAL (BYTE_ARRAY) | | | | | ✅ |
| DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | | ✅ |
| DATE | | | | | ✅ |
| TIME (INT32) | | | | | ✅ |
| TIME (INT64) | | | | | ✅ |
| TIMESTAMP (INT64) | | | | | ✅ |
| INTERVAL | | | | | ❌ |
| JSON | | | | | ❌ |
| BSON | | | | | ❌ |
| LIST | | | | | ✅ |
| MAP | | | | | ✅ |
| UNKNOWN (always null) | | | | | ✅ |
| FLOAT16 | | | | | ✅ |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we mean by logical types are supported? Is it like we won't throw an exception and return something at least, or we expect to actually represent the related value correctly?
Since parquet-java does not have its own representation type system (except the one in the example package that mainly handles the primitive types only), the question is whether at least one binding supports the related type. I am not aware that any binding supports INTERVAL or BSON. JSON is mostly handle as string without any syntax check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think won't throw should be considered as supported. Let me remove INTERVAL and BSON.

Just to confirm that ENUM and UUID are supported by the Avro binding, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, UUID is mapped to the Avro UUID type, ENUM is mapped to string.

But, JSON and FLOAT16 do not seem to be supported either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think FLOAT16 should be considered as supported: apache/parquet-java#1142

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, that's why I asked what supporting a logical type mean in terms of parquet-java where we do not have our own representation type system. For FLOAT16, we return the containing 2 bytes as a Binary in our example package. We do not map this type in any of our bindings.
If we consider FLOAT16 as supported, then all the others are supported as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we consider FLOAT16 as supported, then all the others are supported as well.

Yes, that's why I marked all logical types as supported in the initial commit. Developers are able to implement their own bindings to write and read such types by extending the parquet-java library.

Is there value in adding another option here (P perhaps) that means the implementation at least supports the logical type enum value but will return the physical type?

I'm not sure. Maybe it is a minimum requirement for implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, If we do not want to distinguish the support level of e.g. FLOAT16 and UUID, I'm fine having all marked as supported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why I marked all logical types as supported in the initial commit. Developers are able to implement their own bindings to write and read such types by extending the parquet-java library.

You should then add a note or asterisk to point out that "supported" means the physical type is returned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Added asterisk to explain that.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ part LGTM

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some small comments, but this looks great @wgtmac Thanks for updating this 👍

| DATE | ✅ | ✅ | | | ✅ |
| TIME (INT32) | ✅ | ✅ | | | ✅ |
| TIME (INT64) | ✅ | ✅ | | | ✅ |
| TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we split this out on the unit as well?

Suggested change
| TIMESTAMP (INT64) ||| | ||
| TIMESTAMP (INT64, MICROS) ||| | ||

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? Any unsupported unit in parquet-java?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants