Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use float to string kernel #9470

Merged
merged 13 commits into from
Dec 14, 2023
Merged

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Oct 18, 2023

This PR updates the float to string casting to a new kernel in JNI. It won't match Spark's behavior exactly, but the results are closer than the current version.

The JNI kernel is part of the format_number support, so I split it out as a subtask.

This PR uses Ryū: fast float-to-string conversion (PLDI'18) as the solution for casting float/double to string. The results differ from the output of Spark's in some cases: sometimes the output is shorter (which is arguably more accurate) and sometimes the output may differ in the precise digits output (e.g., see ulfjack/ryu#83).

In most cases, the result will match Spark's results, and in the cases where it does not, the values will match when we cast them back to float.

Depends on: NVIDIA/spark-rapids-jni#1508

There is some related discussion in format_number issue #9173.

Closes #4204

performance test results

50000000 random number generated by BigDataGen:

test code:

spark.time(df.selectExpr("COUNT(cast(a as string)) as a", "COUNT(cast(a as string)) as b", "COUNT(cast(a as string)) as c", "COUNT(cast(a as string)) as d", "COUNT(cast(a as string)) as e").show())

spark.time(df.selectExpr("COUNT(cast(b as string)) as a", "COUNT(cast(b as string)) as b", "COUNT(cast(b as string)) as c", "COUNT(cast(b as string)) as d", "COUNT(cast(b as string)) as e").show())
Data Type GPU Time (ms) CPU Time (ms) Speed up
float 282 5,221 18.51x
double 379 14,638 38.62x

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven self-assigned this Nov 6, 2023
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven marked this pull request as ready for review November 6, 2023 10:36
@thirtiseven thirtiseven marked this pull request as draft November 6, 2023 10:37
@thirtiseven thirtiseven changed the title [WIP] Use float to string kernel Use float to string kernel Nov 6, 2023
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven marked this pull request as ready for review November 7, 2023 10:22
@thirtiseven thirtiseven changed the base branch from branch-23.12 to branch-24.02 November 22, 2023 13:40
@sameerz sameerz added the feature request New feature or request label Dec 5, 2023
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven requested a review from revans2 December 12, 2023 09:04
@revans2
Copy link
Collaborator

revans2 commented Dec 12, 2023

I filed a PR to fix the markdown failure #10029

The GPU will use different precision than Java's toString method when converting floating-point data
types to strings. The GPU uses a lowercase `e` prefix for an exponent while Spark uses uppercase
`E`. As a result the computed string can differ from the default behavior in Spark.
The GPU use [ryu](https://github.com/ulfjack/ryu) as the solution when converting floating-point data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The Rapids Accelerator for Apache Spark uses a method based on ryu when converting floating point data type to string. ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -304,7 +304,22 @@ def test_cast_array_to_string(data_gen, legacy):
_assert_cast_to_string_equal(
data_gen,
{"spark.sql.legacy.castComplexTypesToString.enabled": legacy})


def test_cast_float_to_string():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be an approximate float? How many DATAGEN_SEEDS have you tested with this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested 10 more times with length=200000 and they all passed.

The difference between ryu and jdk's float to string is that jdk sometimes keeps more precision, but ryu keeps it as short as possible. The results will be the same float when converted back. So I think it makes sense that we don't need an approximate float here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also the float to string is fully matched to cpu but double to string is not. It is the reason to test float to string and double to string in two different ways.

Seems the compatibility doc for float to string is outdated, so I'm not sure if we are aware that double to string is not fully matched with cpu. Filed #10037

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven merged commit dd800a4 into NVIDIA:branch-24.02 Dec 14, 2023
36 of 37 checks passed
@thirtiseven thirtiseven deleted the float_to_string branch December 19, 2023 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

casting double to string does not match Spark
3 participants