Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPR] fixed_point + cudf::binary_operation API Changes #7442

Closed
codereport opened this issue Feb 24, 2021 · 13 comments · Fixed by #7435
Closed

[IMPR] fixed_point + cudf::binary_operation API Changes #7442

codereport opened this issue Feb 24, 2021 · 13 comments · Fixed by #7435
Assignees
Labels
breaking Breaking change improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.

Comments

@codereport
Copy link
Contributor

codereport commented Feb 24, 2021

Brief Intro

Recently, @razajafri ran into some issues withe the cudf::binary_operation API for fixed_point. We discussed and I said I would open a PR to clean up the API and add some errors, that initial PR is here: #7435

However, I talked with @jrhemstad and we decided on a slightly different direction, which is proposed below.

Current Behaviour

Currently, cudf::binary_operation takes the cudf::data_type output_type as a parameter specifying the data_type of the output column. However, the exception is fixed_point which currently ignores this parameter, and internally computes the data_type (including the scale) for you. There is one fixed_point binary operation that requires a cudf::data_type, and that is TRUE_DIV, so that Spark and other bindings can get the required division (with a specified data_type/scale).

Proposal

  1. For fixed_point + cudf::binary_operation + DIV always use the cudf::data_type output_type parameter
  2. For fixed_point + cudf::binary_operation + TRUE_DIV, require that the columns/scalars provided as arguments (lhs and rhs) will result in the specified data_type/scale
  3. Provide a convenience function (something like binary_operation_fixed_point_scale()) that will compute the "expected" scale given two input columns/scalars and a binary_operator

This will be a breaking change for all fixed_point + cudf::binary_operation.

Why We Think This is Better:

However, it is an improvement in a number of ways.

  1. The current API is very easy to use incorrectly. If you use fixed_point + cudf::binary_operation + any binary_operator other than TRUE_DIV and specify a scale - it will be ignored which the user might not expect
  2. For fixed_point + cudf::binary_operation + TRUE_DIV, we arbitrarily shift the lhs in order to set the inputs up to result in the desired fixed_point scale. This can lead to avoidable integer overflow if the user is expected to do the rescaling with cudf::cast.
  3. TRUE_DIV will be able to share the exact same code path as DIV, it is basically just an API restriction
@codereport codereport added improvement Improvement / enhancement to an existing function breaking Breaking change labels Feb 24, 2021
@codereport codereport self-assigned this Feb 24, 2021
@codereport
Copy link
Contributor Author

@harrism @revans2 @sameerz

@jrhemstad jrhemstad added the libcudf Affects libcudf (C++/CUDA) code. label Feb 24, 2021
@harrism
Copy link
Member

harrism commented Feb 25, 2021

This can lead to avoidable integer overflow if the user is expected to do the rescaling with cudf::cast.

I don't know enough to understand if you are saying this is a good or a bad thing. Can you elaborate?

@codereport
Copy link
Contributor Author

codereport commented Feb 25, 2021

This can lead to avoidable integer overflow if the user is expected to do the rescaling with cudf::cast.

I don't know enough to understand if you are saying this is a good or a bad thing. Can you elaborate?

Say you have 100,000 / 200,000 where both numbers are scale = 0 and you want the TRUE_DIV resulting scale to be -7. In our current implementation, we shift lhs (aka 100,000) to have scale = -7 so that when you do the 100,000 / 200,000, the resulting scale will be -7 - 0 = -7. This involves multiplying 100,000 by 10^7 which will overflow int32_t. However, if as a user you know all of your values could be scale = 5, you could shift lhs to scale = -2 and rhs to scale = 5 and still get the resulting scale -2 - 5 = -7 without overflowing.

So giving the user this control is a good thing.

@razajafri
Copy link
Contributor

Thanks for this change and a great example clarifying TRUE_DIV

So TRUE_DIV will not take an output type parameter? do we need two separate APIs though? I'm not a CPP expert but can we not honor the type/scale of the inputs if output type isn't specified?

@revans2
Copy link
Contributor

revans2 commented Feb 25, 2021

@razajafri are you talking about precision or scale for TRUE_DIV?

@codereport
In a divide, either TRUE_DIV or DIV. The output scale is partly determined by the precision of both inputs. Because cudf has no knowledge of the true precision, just 32-bit vs 64-bit. How are you going to enforce the requirement for TRUE_DIV?

require that the columns/scalars provided as arguments (lhs and rhs) will result in the specified data_type/scale

Are you going to crash if we run into an overflow/underflow issue related to the perceived precision of the current values?

@razajafri
Copy link
Contributor

@revans2 I meant scale not precision and was referring to @codereport comment about TRUE_DIV.

@codereport
Copy link
Contributor Author

@razajafri

So TRUE_DIV will not take an output type parameter? do we need two separate APIs though? I'm not a CPP expert but can we not honor the type/scale of the inputs if output type isn't specified?

TRUE_DIV will take an output_type. We will just restrict the API such that you have to pass in column/scalar inputs that will yield that output_type. For example:

lhs scale = -3, rhs scale -1, output_type scale = -2 // ok because -3 - -1 = 2
lhs scale = -3, rhs scale  0, output_type scale = -2 // not ok because -3 - -1 != 2

@revans2

In a divide, either TRUE_DIV or DIV. The output scale is partly determined by the precision of both inputs. Because cudf has no knowledge of the true precision, just 32-bit vs 64-bit. How are you going to enforce the requirement for TRUE_DIV?

That will be the responsibility of the caller. If you know you want scale=-6, you can adjust your columns/scalar scales using cudf::cast in order to ajdust their scales so that the resulting column will have scale=-6. Whether you have precision or not shouldn't impact the callers ability to make the necessary adjustments.

Are you going to crash if we run into an overflow/underflow issue related to the perceived precision of the current values?

No, our current bevahiour is not to crash, it is just undefined behaviour consistent with how integers work in C++.

@revans2
Copy link
Contributor

revans2 commented Feb 25, 2021

Hmm I am trying to work this out with the ugliness that Spark is doing, and I think I can make this work. It is not the cleanest solution but it should be doable. My main concern now is what kind of guarantees do we have about overflow in TRUE_DIV? For the current code spark does some ugly things that make it so we end up not doing the divide in a lot of cases when we probably could get away with it, I wont bug you with the details. Under this new system we might be able to do the divide in more cases, but I want to be sure about it.

@codereport
Copy link
Contributor Author

I'm not sure what you mean when you say

guarantees

int32_t / int64_t have the the upper and lower bounds. We don't do any checks or give any warnings when there is overflow. However, you are definitely correct that you should be able to support more cases now that you have control of the rescaling yourself.

@revans2
Copy link
Contributor

revans2 commented Feb 25, 2021

OK Guidelines. I want to be able to fall back to the CPU in cases where we would get an overflow. I can try and figure it out myself if needed, but if I know the algorithm then it makes it simpler to work with.

@codereport
Copy link
Contributor Author

but if I know the algorithm then it makes it simpler to work with

Not sure if this answers your question (let me know if not and I can try again):
In libcudf terms (negative scale = decimal places, positive scale = multiples of 10^scale)

  • When rescaling to scale greater than current scale, you are dividing by 10^scale
  • When rescaling to scale less than current scale, you are multiplying by 10^scale

You need to make sure that the result of these computations don't overflow

@razajafri
Copy link
Contributor

TRUE_DIV will take an output_type. We will just restrict the API such that you have to pass in column/scalar inputs that will yield that output_type. For example:

lhs scale = -3, rhs scale -1, output_type scale = -2 // ok because -3 - -1 = 2
lhs scale = -3, rhs scale  0, output_type scale = -2 // not ok because -3 - -1 != 2

In the second case will it throw an exception with div and true_div???

consider the following

lhs scale = -3, rhs scale -1, output_type scale = -2 
lhs.div(rhs) = -2
lhs.true_div(rhs) = -2

If the outputs of both the APIs is the same then why do we need TRUE_DIV? The way I see it, its just like any other DIV if you provide an outputType, it will cast it to that type regardless of overflow. If you don't provide an outputType it will do a natural divide where the scale subtract and return the answer regardless of overflow? Not sure if I am I over simplifying it.

@revans2 do we still need TRUE_DIV now that DIV is taking the outputType?

@codereport
Copy link
Contributor Author

I agree with @razajafri that really there is no need for TRUE_DIV and it could just lead to confusion. I suggest removing it.

rapids-bot bot pushed a commit that referenced this issue Mar 12, 2021
This resolves #7442

Recently while working with @razajafri on `fixed_point` binary ops, it became clear that the `cudf::binary_operation` is breaking the "easy to use, **hard to misuse**" # 1 design guideline. I knew about this but I slotted it as technical debt to be cleaned up later. Long story short, after discussions with both @razajafri, @jrhemstad and comments on the #7442, we will implement the following:

* [x] For `fixed_point` + `cudf::binary_operation` + `DIV` always **use** the `cudf::data_type output_type` parameter
* [x] ~~For `fixed_point` + `cudf::binary_operation` + `TRUE_DIV`, require that the columns/scalars provided as arguments (`lhs` and `rhs`) will result in the specified `data_type`/`scale`~~
* [x] Provide a convenience function (something like `binary_operation_fixed_point_scale()`) that will compute the "expected" scale given two input columns/scalars and a `binary_operator`
* [x] Remove `TRUE_DIV`
* [x] Add unit tests for different output data_types
* [x] Update Python/Cython 

**This will be a breaking change for all `fixed_point` + `cudf::binary_operation`.**

Authors:
  - Conor Hoekstra (@codereport)

Approvers:
  - Keith Kraus (@kkraus14)
  - Mike Wilson (@hyperbolic2346)

URL: #7435
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this issue Mar 25, 2021
This resolves rapidsai#7442

Recently while working with @razajafri on `fixed_point` binary ops, it became clear that the `cudf::binary_operation` is breaking the "easy to use, **hard to misuse**" # 1 design guideline. I knew about this but I slotted it as technical debt to be cleaned up later. Long story short, after discussions with both @razajafri, @jrhemstad and comments on the rapidsai#7442, we will implement the following:

* [x] For `fixed_point` + `cudf::binary_operation` + `DIV` always **use** the `cudf::data_type output_type` parameter
* [x] ~~For `fixed_point` + `cudf::binary_operation` + `TRUE_DIV`, require that the columns/scalars provided as arguments (`lhs` and `rhs`) will result in the specified `data_type`/`scale`~~
* [x] Provide a convenience function (something like `binary_operation_fixed_point_scale()`) that will compute the "expected" scale given two input columns/scalars and a `binary_operator`
* [x] Remove `TRUE_DIV`
* [x] Add unit tests for different output data_types
* [x] Update Python/Cython 

**This will be a breaking change for all `fixed_point` + `cudf::binary_operation`.**

Authors:
  - Conor Hoekstra (@codereport)

Approvers:
  - Keith Kraus (@kkraus14)
  - Mike Wilson (@hyperbolic2346)

URL: rapidsai#7435
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants