Use float to string kernel #9470

thirtiseven · 2023-10-18T10:58:39Z

This PR updates the float to string casting to a new kernel in JNI. It won't match Spark's behavior exactly, but the results are closer than the current version.

The JNI kernel is part of the format_number support, so I split it out as a subtask.

This PR uses Ryū: fast float-to-string conversion (PLDI'18) as the solution for casting float/double to string. The results differ from the output of Spark's in some cases: sometimes the output is shorter (which is arguably more accurate) and sometimes the output may differ in the precise digits output (e.g., see ulfjack/ryu#83).

In most cases, the result will match Spark's results, and in the cases where it does not, the values will match when we cast them back to float.

Depends on: NVIDIA/spark-rapids-jni#1508

There is some related discussion in format_number issue #9173.

Closes #4204

performance test results

50000000 random number generated by BigDataGen:

test code:

spark.time(df.selectExpr("COUNT(cast(a as string)) as a", "COUNT(cast(a as string)) as b", "COUNT(cast(a as string)) as c", "COUNT(cast(a as string)) as d", "COUNT(cast(a as string)) as e").show())

spark.time(df.selectExpr("COUNT(cast(b as string)) as a", "COUNT(cast(b as string)) as b", "COUNT(cast(b as string)) as c", "COUNT(cast(b as string)) as d", "COUNT(cast(b as string)) as e").show())

Data Type	GPU Time (ms)	CPU Time (ms)	Speed up
float	282	5,221	18.51x
double	379	14,638	38.62x

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2023-12-12T05:42:11Z

build

revans2 · 2023-12-12T16:27:21Z

I filed a PR to fix the markdown failure #10029

revans2 · 2023-12-12T16:30:07Z

docs/compatibility.md

-The GPU will use different precision than Java's toString method when converting floating-point data
-types to strings. The GPU uses a lowercase `e` prefix for an exponent while Spark uses uppercase
-`E`. As a result the computed string can differ from the default behavior in Spark.
+The GPU use [ryu](https://github.com/ulfjack/ryu) as the solution when converting floating-point data


nit: The Rapids Accelerator for Apache Spark uses a method based on ryu when converting floating point data type to string. ...

revans2 · 2023-12-12T16:31:13Z

integration_tests/src/main/python/cast_test.py

@@ -304,7 +304,22 @@ def test_cast_array_to_string(data_gen, legacy):
    _assert_cast_to_string_equal(
        data_gen,
        {"spark.sql.legacy.castComplexTypesToString.enabled": legacy})
-
+
+def test_cast_float_to_string():


This does not need to be an approximate float? How many DATAGEN_SEEDS have you tested with this?

I tested 10 more times with length=200000 and they all passed.

The difference between ryu and jdk's float to string is that jdk sometimes keeps more precision, but ryu keeps it as short as possible. The results will be the same float when converted back. So I think it makes sense that we don't need an approximate float here.

And also the float to string is fully matched to cpu but double to string is not. It is the reason to test float to string and double to string in two different ways.

Seems the compatibility doc for float to string is outdated, so I'm not sure if we are aware that double to string is not fully matched with cpu. Filed #10037

integration_tests/src/main/python/cast_test.py

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2023-12-13T23:42:20Z

build

thirtiseven added 4 commits October 18, 2023 09:51

remove config and call jni

bbc183d

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

add integration test

389bf67

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

revert config removal

7c76b00

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

revert unnecessary copyright update

1df934a

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven mentioned this pull request Oct 18, 2023

Adding float to string kernel NVIDIA/spark-rapids-jni#1508

Merged

thirtiseven added 2 commits October 20, 2023 09:50

change IT to test

826070b

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

update integration tests

cf230e3

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven self-assigned this Nov 6, 2023

update doc

7dfae1f

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven marked this pull request as ready for review November 6, 2023 10:36

thirtiseven marked this pull request as draft November 6, 2023 10:37

Merge branch 'branch-23.12' into float_to_string

9e2429b

thirtiseven changed the title ~~[WIP] Use float to string kernel~~ Use float to string kernel Nov 6, 2023

thirtiseven added 2 commits November 7, 2023 18:16

update tests

9d4d3e4

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

clean up

8f8a9ad

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven marked this pull request as ready for review November 7, 2023 10:22

thirtiseven changed the base branch from branch-23.12 to branch-24.02 November 22, 2023 13:40

sameerz added the feature request New feature or request label Dec 5, 2023

thirtiseven added 2 commits December 8, 2023 13:13

Merge branch 'branch-24.02' into float_to_string

81571c1

Merge branch 'branch-24.02' into float_to_string

ca289aa

thirtiseven requested a review from revans2 December 12, 2023 09:04

revans2 reviewed Dec 12, 2023

View reviewed changes

compatibility doc update

e6a69ef

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

revans2 approved these changes Dec 13, 2023

View reviewed changes

thirtiseven merged commit dd800a4 into NVIDIA:branch-24.02 Dec 14, 2023

thirtiseven deleted the float_to_string branch December 19, 2023 08:45

NvTimLiu mentioned this pull request Apr 11, 2024

Merge branch-24.04 into main NvTimLiu/spark-rapids-jni#6

Merged

NvTimLiu mentioned this pull request Apr 11, 2024

Update latest changelog [skip ci] #10683

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use float to string kernel #9470

Use float to string kernel #9470

thirtiseven commented Oct 18, 2023 •

edited

Loading

thirtiseven commented Dec 12, 2023

revans2 commented Dec 12, 2023

revans2 Dec 12, 2023

thirtiseven Dec 13, 2023

revans2 Dec 12, 2023

thirtiseven Dec 13, 2023

thirtiseven Dec 13, 2023

thirtiseven commented Dec 13, 2023

Use float to string kernel #9470

Use float to string kernel #9470

Conversation

thirtiseven commented Oct 18, 2023 • edited Loading

thirtiseven commented Dec 12, 2023

revans2 commented Dec 12, 2023

revans2 Dec 12, 2023

Choose a reason for hiding this comment

thirtiseven Dec 13, 2023

Choose a reason for hiding this comment

revans2 Dec 12, 2023

Choose a reason for hiding this comment

thirtiseven Dec 13, 2023

Choose a reason for hiding this comment

thirtiseven Dec 13, 2023

Choose a reason for hiding this comment

thirtiseven commented Dec 13, 2023

thirtiseven commented Oct 18, 2023 •

edited

Loading