-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use float to string kernel #9470
Use float to string kernel #9470
Conversation
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
build |
I filed a PR to fix the markdown failure #10029 |
docs/compatibility.md
Outdated
The GPU will use different precision than Java's toString method when converting floating-point data | ||
types to strings. The GPU uses a lowercase `e` prefix for an exponent while Spark uses uppercase | ||
`E`. As a result the computed string can differ from the default behavior in Spark. | ||
The GPU use [ryu](https://github.com/ulfjack/ryu) as the solution when converting floating-point data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The Rapids Accelerator for Apache Spark uses a method based on ryu when converting floating point data type to string. ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -304,7 +304,22 @@ def test_cast_array_to_string(data_gen, legacy): | |||
_assert_cast_to_string_equal( | |||
data_gen, | |||
{"spark.sql.legacy.castComplexTypesToString.enabled": legacy}) | |||
|
|||
|
|||
def test_cast_float_to_string(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not need to be an approximate float? How many DATAGEN_SEEDS have you tested with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested 10 more times with length=200000 and they all passed.
The difference between ryu and jdk's float to string is that jdk sometimes keeps more precision, but ryu keeps it as short as possible. The results will be the same float when converted back. So I think it makes sense that we don't need an approximate float here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also the float to string is fully matched to cpu but double to string is not. It is the reason to test float to string and double to string in two different ways.
Seems the compatibility doc for float to string is outdated, so I'm not sure if we are aware that double to string is not fully matched with cpu. Filed #10037
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
build |
This PR updates the float to string casting to a new kernel in JNI. It won't match Spark's behavior exactly, but the results are closer than the current version.
The JNI kernel is part of the format_number support, so I split it out as a subtask.
This PR uses Ryū: fast float-to-string conversion (PLDI'18) as the solution for casting float/double to string. The results differ from the output of Spark's in some cases: sometimes the output is shorter (which is arguably more accurate) and sometimes the output may differ in the precise digits output (e.g., see ulfjack/ryu#83).
In most cases, the result will match Spark's results, and in the cases where it does not, the values will match when we cast them back to float.
Depends on: NVIDIA/spark-rapids-jni#1508
There is some related discussion in format_number issue #9173.
Closes #4204
performance test results
50000000 random number generated by BigDataGen:
test code: