[IMPR] `fixed_point` + `cudf::binary_operation` API Changes #7442

codereport · 2021-02-24T20:27:41Z

Brief Intro

Recently, @razajafri ran into some issues withe the cudf::binary_operation API for fixed_point. We discussed and I said I would open a PR to clean up the API and add some errors, that initial PR is here: #7435

However, I talked with @jrhemstad and we decided on a slightly different direction, which is proposed below.

Current Behaviour

Currently, cudf::binary_operation takes the cudf::data_type output_type as a parameter specifying the data_type of the output column. However, the exception is fixed_point which currently ignores this parameter, and internally computes the data_type (including the scale) for you. There is one fixed_point binary operation that requires a cudf::data_type, and that is TRUE_DIV, so that Spark and other bindings can get the required division (with a specified data_type/scale).

Proposal

For fixed_point + cudf::binary_operation + DIV always use the cudf::data_type output_type parameter
For fixed_point + cudf::binary_operation + TRUE_DIV, require that the columns/scalars provided as arguments (lhs and rhs) will result in the specified data_type/scale
Provide a convenience function (something like binary_operation_fixed_point_scale()) that will compute the "expected" scale given two input columns/scalars and a binary_operator

This will be a breaking change for all fixed_point + cudf::binary_operation.

Why We Think This is Better:

However, it is an improvement in a number of ways.

The current API is very easy to use incorrectly. If you use fixed_point + cudf::binary_operation + any binary_operator other than TRUE_DIV and specify a scale - it will be ignored which the user might not expect
For fixed_point + cudf::binary_operation + TRUE_DIV, we arbitrarily shift the lhs in order to set the inputs up to result in the desired fixed_point scale. This can lead to avoidable integer overflow if the user is expected to do the rescaling with cudf::cast.
TRUE_DIV will be able to share the exact same code path as DIV, it is basically just an API restriction

The text was updated successfully, but these errors were encountered:

codereport · 2021-02-24T20:28:20Z

@harrism @revans2 @sameerz

harrism · 2021-02-25T04:18:51Z

This can lead to avoidable integer overflow if the user is expected to do the rescaling with cudf::cast.

I don't know enough to understand if you are saying this is a good or a bad thing. Can you elaborate?

codereport · 2021-02-25T04:24:17Z

This can lead to avoidable integer overflow if the user is expected to do the rescaling with cudf::cast.

I don't know enough to understand if you are saying this is a good or a bad thing. Can you elaborate?

Say you have 100,000 / 200,000 where both numbers are scale = 0 and you want the TRUE_DIV resulting scale to be -7. In our current implementation, we shift lhs (aka 100,000) to have scale = -7 so that when you do the 100,000 / 200,000, the resulting scale will be -7 - 0 = -7. This involves multiplying 100,000 by 10^7 which will overflow int32_t. However, if as a user you know all of your values could be scale = 5, you could shift lhs to scale = -2 and rhs to scale = 5 and still get the resulting scale -2 - 5 = -7 without overflowing.

So giving the user this control is a good thing.

razajafri · 2021-02-25T06:36:08Z

Thanks for this change and a great example clarifying TRUE_DIV

So TRUE_DIV will not take an output type parameter? do we need two separate APIs though? I'm not a CPP expert but can we not honor the type/scale of the inputs if output type isn't specified?

revans2 · 2021-02-25T16:17:42Z

@razajafri are you talking about precision or scale for TRUE_DIV?

@codereport
In a divide, either TRUE_DIV or DIV. The output scale is partly determined by the precision of both inputs. Because cudf has no knowledge of the true precision, just 32-bit vs 64-bit. How are you going to enforce the requirement for TRUE_DIV?

require that the columns/scalars provided as arguments (lhs and rhs) will result in the specified data_type/scale

Are you going to crash if we run into an overflow/underflow issue related to the perceived precision of the current values?

razajafri · 2021-02-25T16:40:22Z

@revans2 I meant scale not precision and was referring to @codereport comment about TRUE_DIV.

codereport · 2021-02-25T16:47:29Z

@razajafri

So TRUE_DIV will not take an output type parameter? do we need two separate APIs though? I'm not a CPP expert but can we not honor the type/scale of the inputs if output type isn't specified?

TRUE_DIV will take an output_type. We will just restrict the API such that you have to pass in column/scalar inputs that will yield that output_type. For example:

lhs scale = -3, rhs scale -1, output_type scale = -2 // ok because -3 - -1 = 2
lhs scale = -3, rhs scale  0, output_type scale = -2 // not ok because -3 - -1 != 2

@revans2

In a divide, either TRUE_DIV or DIV. The output scale is partly determined by the precision of both inputs. Because cudf has no knowledge of the true precision, just 32-bit vs 64-bit. How are you going to enforce the requirement for TRUE_DIV?

That will be the responsibility of the caller. If you know you want scale=-6, you can adjust your columns/scalar scales using cudf::cast in order to ajdust their scales so that the resulting column will have scale=-6. Whether you have precision or not shouldn't impact the callers ability to make the necessary adjustments.

Are you going to crash if we run into an overflow/underflow issue related to the perceived precision of the current values?

No, our current bevahiour is not to crash, it is just undefined behaviour consistent with how integers work in C++.

revans2 · 2021-02-25T17:18:16Z

Hmm I am trying to work this out with the ugliness that Spark is doing, and I think I can make this work. It is not the cleanest solution but it should be doable. My main concern now is what kind of guarantees do we have about overflow in TRUE_DIV? For the current code spark does some ugly things that make it so we end up not doing the divide in a lot of cases when we probably could get away with it, I wont bug you with the details. Under this new system we might be able to do the divide in more cases, but I want to be sure about it.

codereport · 2021-02-25T17:30:09Z

I'm not sure what you mean when you say

guarantees

int32_t / int64_t have the the upper and lower bounds. We don't do any checks or give any warnings when there is overflow. However, you are definitely correct that you should be able to support more cases now that you have control of the rescaling yourself.

revans2 · 2021-02-25T18:16:50Z

OK Guidelines. I want to be able to fall back to the CPU in cases where we would get an overflow. I can try and figure it out myself if needed, but if I know the algorithm then it makes it simpler to work with.

codereport · 2021-02-25T18:29:53Z

but if I know the algorithm then it makes it simpler to work with

Not sure if this answers your question (let me know if not and I can try again):
In libcudf terms (negative scale = decimal places, positive scale = multiples of 10^scale)

When rescaling to scale greater than current scale, you are dividing by 10^scale
When rescaling to scale less than current scale, you are multiplying by 10^scale

You need to make sure that the result of these computations don't overflow

razajafri · 2021-02-25T23:21:11Z

TRUE_DIV will take an output_type. We will just restrict the API such that you have to pass in column/scalar inputs that will yield that output_type. For example:
lhs scale = -3, rhs scale -1, output_type scale = -2 // ok because -3 - -1 = 2
lhs scale = -3, rhs scale  0, output_type scale = -2 // not ok because -3 - -1 != 2

In the second case will it throw an exception with div and true_div???

consider the following

lhs scale = -3, rhs scale -1, output_type scale = -2 
lhs.div(rhs) = -2
lhs.true_div(rhs) = -2

If the outputs of both the APIs is the same then why do we need TRUE_DIV? The way I see it, its just like any other DIV if you provide an outputType, it will cast it to that type regardless of overflow. If you don't provide an outputType it will do a natural divide where the scale subtract and return the answer regardless of overflow? Not sure if I am I over simplifying it.

@revans2 do we still need TRUE_DIV now that DIV is taking the outputType?

codereport · 2021-02-25T23:26:18Z

I agree with @razajafri that really there is no need for TRUE_DIV and it could just lead to confusion. I suggest removing it.

@razajafri

This resolves #7442 Recently while working with @razajafri on `fixed_point` binary ops, it became clear that the `cudf::binary_operation` is breaking the "easy to use, **hard to misuse**" # 1 design guideline. I knew about this but I slotted it as technical debt to be cleaned up later. Long story short, after discussions with both @razajafri, @jrhemstad and comments on the #7442, we will implement the following: * [x] For `fixed_point` + `cudf::binary_operation` + `DIV` always **use** the `cudf::data_type output_type` parameter * [x] ~~For `fixed_point` + `cudf::binary_operation` + `TRUE_DIV`, require that the columns/scalars provided as arguments (`lhs` and `rhs`) will result in the specified `data_type`/`scale`~~ * [x] Provide a convenience function (something like `binary_operation_fixed_point_scale()`) that will compute the "expected" scale given two input columns/scalars and a `binary_operator` * [x] Remove `TRUE_DIV` * [x] Add unit tests for different output data_types * [x] Update Python/Cython **This will be a breaking change for all `fixed_point` + `cudf::binary_operation`.** Authors: - Conor Hoekstra (@codereport) Approvers: - Keith Kraus (@kkraus14) - Mike Wilson (@hyperbolic2346) URL: #7435

@razajafri

This resolves rapidsai#7442 Recently while working with @razajafri on `fixed_point` binary ops, it became clear that the `cudf::binary_operation` is breaking the "easy to use, **hard to misuse**" # 1 design guideline. I knew about this but I slotted it as technical debt to be cleaned up later. Long story short, after discussions with both @razajafri, @jrhemstad and comments on the rapidsai#7442, we will implement the following: * [x] For `fixed_point` + `cudf::binary_operation` + `DIV` always **use** the `cudf::data_type output_type` parameter * [x] ~~For `fixed_point` + `cudf::binary_operation` + `TRUE_DIV`, require that the columns/scalars provided as arguments (`lhs` and `rhs`) will result in the specified `data_type`/`scale`~~ * [x] Provide a convenience function (something like `binary_operation_fixed_point_scale()`) that will compute the "expected" scale given two input columns/scalars and a `binary_operator` * [x] Remove `TRUE_DIV` * [x] Add unit tests for different output data_types * [x] Update Python/Cython **This will be a breaking change for all `fixed_point` + `cudf::binary_operation`.** Authors: - Conor Hoekstra (@codereport) Approvers: - Keith Kraus (@kkraus14) - Mike Wilson (@hyperbolic2346) URL: rapidsai#7435

codereport added improvement Improvement / enhancement to an existing function breaking Breaking change labels Feb 24, 2021

codereport self-assigned this Feb 24, 2021

jrhemstad added the libcudf Affects libcudf (C++/CUDA) code. label Feb 24, 2021

codereport mentioned this issue Feb 25, 2021

fixed_point + cudf::binary_operation API Changes #7435

Merged

6 tasks

rapids-bot bot closed this as completed in #7435 Mar 12, 2021

jrhemstad mentioned this issue Apr 30, 2021

[BUG] Division binary operator fails on Decimal column #8083

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPR] `fixed_point` + `cudf::binary_operation` API Changes #7442

[IMPR] `fixed_point` + `cudf::binary_operation` API Changes #7442

codereport commented Feb 24, 2021 •

edited

Loading

codereport commented Feb 24, 2021

harrism commented Feb 25, 2021

codereport commented Feb 25, 2021 •

edited

Loading

razajafri commented Feb 25, 2021

revans2 commented Feb 25, 2021

razajafri commented Feb 25, 2021

codereport commented Feb 25, 2021

revans2 commented Feb 25, 2021

codereport commented Feb 25, 2021

revans2 commented Feb 25, 2021

codereport commented Feb 25, 2021

razajafri commented Feb 25, 2021

codereport commented Feb 25, 2021

[IMPR] fixed_point + cudf::binary_operation API Changes #7442

[IMPR] fixed_point + cudf::binary_operation API Changes #7442

Comments

codereport commented Feb 24, 2021 • edited Loading

Brief Intro

Current Behaviour

Proposal

Why We Think This is Better:

codereport commented Feb 24, 2021

harrism commented Feb 25, 2021

codereport commented Feb 25, 2021 • edited Loading

razajafri commented Feb 25, 2021

revans2 commented Feb 25, 2021

razajafri commented Feb 25, 2021

codereport commented Feb 25, 2021

revans2 commented Feb 25, 2021

codereport commented Feb 25, 2021

revans2 commented Feb 25, 2021

codereport commented Feb 25, 2021

razajafri commented Feb 25, 2021

codereport commented Feb 25, 2021

[IMPR] `fixed_point` + `cudf::binary_operation` API Changes #7442

[IMPR] `fixed_point` + `cudf::binary_operation` API Changes #7442

codereport commented Feb 24, 2021 •

edited

Loading

codereport commented Feb 25, 2021 •

edited

Loading