-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IMPR] fixed_point
+ cudf::binary_operation
API Changes
#7442
Comments
I don't know enough to understand if you are saying this is a good or a bad thing. Can you elaborate? |
Say you have So giving the user this control is a good thing. |
Thanks for this change and a great example clarifying TRUE_DIV So TRUE_DIV will not take an output type parameter? do we need two separate APIs though? I'm not a CPP expert but can we not honor the type/scale of the inputs if output type isn't specified? |
@razajafri are you talking about precision or scale for TRUE_DIV? @codereport
Are you going to crash if we run into an overflow/underflow issue related to the perceived precision of the current values? |
@revans2 I meant scale not precision and was referring to @codereport comment about TRUE_DIV. |
lhs scale = -3, rhs scale -1, output_type scale = -2 // ok because -3 - -1 = 2
lhs scale = -3, rhs scale 0, output_type scale = -2 // not ok because -3 - -1 != 2
That will be the responsibility of the caller. If you know you want
No, our current bevahiour is not to crash, it is just undefined behaviour consistent with how integers work in C++. |
Hmm I am trying to work this out with the ugliness that Spark is doing, and I think I can make this work. It is not the cleanest solution but it should be doable. My main concern now is what kind of guarantees do we have about overflow in TRUE_DIV? For the current code spark does some ugly things that make it so we end up not doing the divide in a lot of cases when we probably could get away with it, I wont bug you with the details. Under this new system we might be able to do the divide in more cases, but I want to be sure about it. |
I'm not sure what you mean when you say
|
OK Guidelines. I want to be able to fall back to the CPU in cases where we would get an overflow. I can try and figure it out myself if needed, but if I know the algorithm then it makes it simpler to work with. |
Not sure if this answers your question (let me know if not and I can try again):
You need to make sure that the result of these computations don't overflow |
In the second case will it throw an exception with consider the following
If the outputs of both the APIs is the same then why do we need @revans2 do we still need |
I agree with @razajafri that really there is no need for |
This resolves #7442 Recently while working with @razajafri on `fixed_point` binary ops, it became clear that the `cudf::binary_operation` is breaking the "easy to use, **hard to misuse**" # 1 design guideline. I knew about this but I slotted it as technical debt to be cleaned up later. Long story short, after discussions with both @razajafri, @jrhemstad and comments on the #7442, we will implement the following: * [x] For `fixed_point` + `cudf::binary_operation` + `DIV` always **use** the `cudf::data_type output_type` parameter * [x] ~~For `fixed_point` + `cudf::binary_operation` + `TRUE_DIV`, require that the columns/scalars provided as arguments (`lhs` and `rhs`) will result in the specified `data_type`/`scale`~~ * [x] Provide a convenience function (something like `binary_operation_fixed_point_scale()`) that will compute the "expected" scale given two input columns/scalars and a `binary_operator` * [x] Remove `TRUE_DIV` * [x] Add unit tests for different output data_types * [x] Update Python/Cython **This will be a breaking change for all `fixed_point` + `cudf::binary_operation`.** Authors: - Conor Hoekstra (@codereport) Approvers: - Keith Kraus (@kkraus14) - Mike Wilson (@hyperbolic2346) URL: #7435
This resolves rapidsai#7442 Recently while working with @razajafri on `fixed_point` binary ops, it became clear that the `cudf::binary_operation` is breaking the "easy to use, **hard to misuse**" # 1 design guideline. I knew about this but I slotted it as technical debt to be cleaned up later. Long story short, after discussions with both @razajafri, @jrhemstad and comments on the rapidsai#7442, we will implement the following: * [x] For `fixed_point` + `cudf::binary_operation` + `DIV` always **use** the `cudf::data_type output_type` parameter * [x] ~~For `fixed_point` + `cudf::binary_operation` + `TRUE_DIV`, require that the columns/scalars provided as arguments (`lhs` and `rhs`) will result in the specified `data_type`/`scale`~~ * [x] Provide a convenience function (something like `binary_operation_fixed_point_scale()`) that will compute the "expected" scale given two input columns/scalars and a `binary_operator` * [x] Remove `TRUE_DIV` * [x] Add unit tests for different output data_types * [x] Update Python/Cython **This will be a breaking change for all `fixed_point` + `cudf::binary_operation`.** Authors: - Conor Hoekstra (@codereport) Approvers: - Keith Kraus (@kkraus14) - Mike Wilson (@hyperbolic2346) URL: rapidsai#7435
Brief Intro
Recently, @razajafri ran into some issues withe the
cudf::binary_operation
API forfixed_point
. We discussed and I said I would open a PR to clean up the API and add some errors, that initial PR is here: #7435However, I talked with @jrhemstad and we decided on a slightly different direction, which is proposed below.
Current Behaviour
Currently,
cudf::binary_operation
takes thecudf::data_type output_type
as a parameter specifying thedata_type
of the output column. However, the exception isfixed_point
which currently ignores this parameter, and internally computes thedata_type
(including the scale) for you. There is onefixed_point
binary operation that requires acudf::data_type
, and that isTRUE_DIV
, so that Spark and other bindings can get the required division (with a specifieddata_type
/scale
).Proposal
fixed_point
+cudf::binary_operation
+DIV
always use thecudf::data_type output_type
parameterfixed_point
+cudf::binary_operation
+TRUE_DIV
, require that the columns/scalars provided as arguments (lhs
andrhs
) will result in the specifieddata_type
/scale
binary_operation_fixed_point_scale()
) that will compute the "expected" scale given two input columns/scalars and abinary_operator
This will be a breaking change for all
fixed_point
+cudf::binary_operation
.Why We Think This is Better:
However, it is an improvement in a number of ways.
fixed_point
+cudf::binary_operation
+ anybinary_operator
other thanTRUE_DIV
and specify a scale - it will be ignored which the user might not expectfixed_point
+cudf::binary_operation
+TRUE_DIV
, we arbitrarily shift thelhs
in order to set the inputs up to result in the desiredfixed_point scale
. This can lead to avoidable integer overflow if the user is expected to do the rescaling withcudf::cast
.TRUE_DIV
will be able to share the exact same code path asDIV
, it is basically just an API restrictionThe text was updated successfully, but these errors were encountered: