PARQUET-758: Add Float16/Half-float logical type #184

anjakefala · 2022-08-26T21:21:14Z

In the Mailing List, I proposed the addition of a Half Float (float16) type in Parquet: https://lists.apache.org/thread/03vmcj7ygwvsbno764vd1hr954p62zr5

This type is becoming increasingly popular in Machine Learning, and there are a bundle of issues requesting its support in Parquet:
https://issues.apache.org/jira/browse/PARQUET-1647
https://issues.apache.org/jira/browse/PARQUET-758
https://issues.apache.org/jira/browse/ARROW-17464
apache/arrow#2691

This is my first logical type proposal! I followed this PR as inspiration, but really welcome feedback from the community.

Implementation PRs:

C++ GH-36036: [C++][Python][Parquet] Implement Float16 logical type arrow#36073

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira 1 and 2 issues and references them in the PR title.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

LogicalTypes.md

src/main/thrift/parquet.thrift

pitrou · 2022-08-29T09:02:31Z

@anjakefala You need to add to the LogicalType union, not to the Type enum (which is for physical types).

Also cc @emkornfield

Type involves a trade-off of reduced precision, in exchange for more efficient storage.

emkornfield · 2022-08-30T06:05:34Z

We should probably specify that using the Byte Split Encodings can be used for this type as well?

Also, in general, if possible try to avoid force pushing, as it makes it harder to compare iterative changes (this might not be the style in this repo, though so if you found instructions elsewhere on this, please ignore).

emkornfield · 2022-08-30T06:08:18Z

It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts?

emkornfield · 2022-08-30T06:09:58Z

src/main/thrift/parquet.thrift

@@ -232,6 +232,7 @@ struct MapType {}     // see LogicalTypes.md
 struct ListType {}    // see LogicalTypes.md
 struct EnumType {}    // allowed for BINARY, must be encoded with UTF-8
 struct DateType {}    // allowed for INT32
+struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes


why not allow bit splitting?

@emkornfield What do you mean here?

Ah, perhaps you mean the BYTE_STREAM_SPLIT encoding?

yes. BYTE_STREAM_SPLIT

Well, I guess it wouldn't cost much to allow it (implementations would not support it at the start anyway).

gszadovszky · 2022-08-30T07:53:11Z

It isn't clear to me if this should be a logical type or a physical type. We would need understand if there is different handling for forward compatibility purposes (what do we want the desired behavior to be be). I think C++ might be lenient here, but don't know about parquet-mr @gszadovszky thoughts?

I think the basic idea behind having physical and logical types is to support forward compatibility since we can always represent (somehow) a long-existing physical type while logical types are getting extended. Parquet-mr should work fine with "unknown" logical types by reading it back as an un-annotated physical vale (a Binary with two bytes in this case).
So, if the community supports having a half-precision floating point type I would vote on specifying it as a logical type.

The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics). It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type. (BTW the sorting order of floating point numbers are still an open issue: PARQUET-1222)

pitrou · 2022-08-30T08:00:58Z

It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type.

While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy:
https://github.com/numpy/numpy/blob/8a0859835d3e6002858b9ffd9a232b059cf9ea6c/numpy/core/src/npymath/halffloat.c#L169-L190
(npy_half is just an unsigned 16-bit integer in this context)

gszadovszky · 2022-08-30T08:21:44Z

It would not be too easy to implement the half-precision floating point comparison logic since java does not have such a primitive type.

While not effortless, it should be relatively easy to adapt one of the routines that's available from other open source projects, such as Numpy: https://github.com/numpy/numpy/blob/8a0859835d3e6002858b9ffd9a232b059cf9ea6c/numpy/core/src/npymath/halffloat.c#L169-L190 (npy_half is just an unsigned 16-bit integer in this context)

It is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See PARQUET-1222 for details.

emkornfield · 2022-08-31T03:35:04Z

t is not that trivial. For the half-precision floating point numbers we do not have native support for either cpp or java so we can define the total ordering as we want. But we shall do the same for the existing floating point numbers that most languages have native support. Even though they are following the same standard the total ordering either does not exist or have different implementations. See PARQUET-1222 for details.

I think these are orthogonal. I might be missing something but it seems like it would not be to hard to case float16 to float in java/cpp and do the comparison in that space and cast it back down. This might not be the most efficient implementation but would be straightforward? I am probably missing something. It would be nice to resolve PARQUET-1222 so the same semantics would apply to all floating point numbers.

The tricky thing will be the implementations. Even though parquet-mr does not really care about converting the values according to their logical types we still need to care about the logical types at the ordering (min/max values in the statistics).

It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today?

gszadovszky · 2022-08-31T05:49:21Z

I've came up with this ordering thing because we specify it for every logical types. (Unfortunately we don't do this for primitive types.) Therefore, I would expect to have the order specified for this new logical type as well which is not trivial and requires to solve PARQUET-1222. At least we should add a note about this issue.

It seems this would require parquet implementations to null out statistics for logical types that they don't support, does parquet-mr do that today?

I do not have the proper environment to test it but based on the code we do not handle unknown logical types well in parquet-mr. I think it handles unknown logical types as if they were not there at all which is fine from the data point of view but we would blindly use the statistics which may cause data loss. Created PARQUET-2182 to track this.

emkornfield · 2022-09-04T05:34:53Z

I think Parquet C++ probably has the same issue. Let me reread PARQUET-1222. to see what the current state is and if we can push it forward.

pitrou · 2022-09-07T11:44:51Z

I agree with @emkornfield that ordering issues seem largely orthogonal, as they also affect FLOAT32 and FLOAT64 types.

anjakefala · 2022-09-29T22:26:00Z

@pitrou @emkornfield @gszadovszky

Is there anything I can do to move this addition forward? Can I help with any code?

In terms of design, my understanding from reading the comments is that @gszadovszky brought up an ordering concern (valid, but not a blocker?), and that a decision needs to be made on whether float16 would be implemented as a logical or physical type?

emkornfield · 2022-09-30T05:23:17Z

Sorry for the delay, it sounds like PARQUET-1222 is blocker, let me make a proposal there and see if we can at least come to consensus on approach and maybe this feature can be the first test-case for it.

emkornfield · 2022-12-07T17:55:26Z

Sorry for the delay but PARQUET-1222 has now been merged, so I think this is unblocked.

anjakefala · 2022-12-07T22:07:47Z

Thanks so much for the update @emkornfield!

What is the next step I can take?

emkornfield · 2022-12-07T22:48:54Z

@anjakefala IIUC, I think the prior objection was around not properly floating point sorting for statistics correctly. I think the main thing is to update the specification to say this requires the same sorting logic as float32 and float64. I need to rereview the current state of things, but then I think we can probably try to vote on the mailing list to see if this type is acceptable. I'm not sure on the exact process here (I don't know if implementations are required before a vote). @gszadovszky thoughts?

anjakefala · 2022-12-14T22:20:19Z

Thank you @emkornfield! I added the sort order to the spec.

anjakefala · 2023-01-09T19:53:14Z

Hey @emkornfield! Is it reasonable for me to send a proposal to the mailing list for a vote? It seems @gszadovszky is not available for insight; is there anyone else that can provide it?

emkornfield · 2023-02-01T17:48:02Z

@shangxinli are there guidelines for what needs to happen to accept this addition?

pitrou · 2023-02-01T17:50:36Z

@shangxinli are there guidelines for what needs to happen to accept this addition?

I suppose it needs a discussion and then a formal vote on the ML?

shangxinli · 2023-02-01T18:27:40Z

As @julienledem mentioned in the email discussion, let's have the corresponding PRs for support in the Java and C++ implementation ready before we merge this PR. We would like to have implementation support when the new type is released.

anjakefala · 2023-10-04T17:13:07Z

It seems that both the Java implementation and the C++ implementation are in a state of readiness.

Has the vote started? Can anyone with visibility update me on the status?

benibus · 2023-10-04T17:25:50Z

@anjakefala Agreed that everything seems to be in place. I'll be starting the vote on the ML later today.

anjakefala · 2023-10-05T17:37:40Z

@pitrou @emkornfield @gszadovszky @JFinis @julienledem @shangxinli

The vote has been started by @benibus here: https://lists.apache.org/thread/gyvqcx9ssxkjlrwogqwy7n4z6ofdm871 Please chime in! I would also appreciate anyone forwarding the vote to the private listserv.

benibus · 2023-10-16T18:34:45Z

@pitrou @gszadovszky @julienledem

Given that the vote for this has just passed, I believe we should be good to merge this now? (pending a final review pass, of course)

wgtmac · 2023-10-17T01:08:10Z

@pitrou @gszadovszky @julienledem

Given that the vote for this has just passed, I believe we should be good to merge this now? (pending a final review pass, of course)

Should we merge the PR in parquet-format first? My point is that it would be weird if this change commits with an unreleased and even uncommitted change of parquet.thrift.

anjakefala · 2023-10-17T02:58:13Z

@wgtmac

Should we merge the PR in parquet-format first? My point is that it would be weird if this change commits with an unreleased and even uncommitted change of parquet.thrift.

This is the parquet-format PR!

There are too many PRs. xD

wgtmac · 2023-10-17T03:22:40Z

@wgtmac

Should we merge the PR in parquet-format first? My point is that it would be weird if this change commits with an unreleased and even uncommitted change of parquet.thrift.

This is the parquet-format PR!

There are too many PRs. xD

My bad! I got lost in these PRs.

gszadovszky

I've suggested the name FLOAT_16 in the vote like we already have logical types INT_8 etc. But this is not a strong opinion, I am fine with as is.

I agree with @emkornfield that we should allow the encoding BYTE_STREAM_SPLIT to be used for this new logical type. It is fine to handle it separately, though.

LogicalTypes.md

pitrou · 2023-10-17T07:02:24Z

I agree with @emkornfield that we should allow the encoding BYTE_STREAM_SPLIT to be used for this new logical type. It is fine to handle it separately, though.

I would contend that perhaps BYTE_STREAM_SPLIT wouldn't yield very interesting results on FLOAT16. It would be interesting to get numbers.

benibus · 2023-10-17T19:38:53Z

I've suggested the name FLOAT_16 in the vote like we already have logical types INT_8 etc

I think this was only the convention for legacy ConvertedType enums. We could theoretically deviate from that since there are no sized integral logical types (they're all rolled into INTEGER/IntType).

As for BYTE_STREAM_SPLIT, my feeling is that we'll probably want it (for parity with FLOAT/DOUBLE, at least), but it could be added as a follow-up - along with an implementation + benchmarks, if necessary. There's also some ambiguity about whether to support BSS for FixedLenByteArray generally, which may warrant a separate discussion.

anjakefala · 2023-10-24T21:13:19Z

@gszadovszky What is the merging process once it has approval and passed voting? =)

gszadovszky · 2023-10-25T07:04:51Z

@benibus, could you officially close the vote on the mailing list so it is clear that it has passed?
@anjakefala, since we have 3 approvals already on this PR, any committer can push it. I would wait for the official closing of the vote, thought.

benibus · 2023-10-26T19:32:38Z

For the record, I've announced the vote's passing in the original ML thread itself (apologies if the [RESULT] thread wasn't sufficient).

gszadovszky · 2023-10-27T07:00:19Z

Sorry, @benibus. My bad. Thank you for managing the vote!
I'm pushing this PR...

gszadovszky · 2023-10-27T07:06:05Z

@anjakefala, do you have a jira user so I can assign it to you?

anjakefala · 2023-10-27T18:03:17Z

I really appreciate everyone who took time out of their lives to give this PR attention! :)) Thanks for the final merge @gszadovszky! And yes, my apache arrow JIRA handle is the same as github @anjakefala.

### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: #37582 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

…37599) ### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: apache#37582 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

### Rationale for this change There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache/arrow#36073), so we should add one for Go as well. ### What changes are included in this PR? - [x] Adds `LogicalType` definitions and methods for `Float16` - [x] Adds support for `Float16` column statistics and comparators - [x] Adds support for interchange between Parquet and Arrow's half-precision float ### Are these changes tested? Yes ### Are there any user-facing changes? Yes * Closes: #37582 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

anjakefala mentioned this pull request Aug 26, 2022

ARROW-17464: [C++] Store/Read float16 values in Parquet as FixedSizeBinary apache/arrow#13947

Closed