Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] logical_cast from FLOAT32 to INT32 and FLOAT64 to INT64 or bit_cast for the same #6834

Closed
revans2 opened this issue Nov 23, 2020 · 16 comments · Fixed by #7373
Closed
Assignees
Labels
feature request New feature or request good first issue Good for newcomers Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Nov 23, 2020

Spark is really inconsistent in how it handles some values like -0.0 vs 0.0 and the various NaN values that are possible. I don't expect cuDF to be aware of any of this, but I would like the ability to work around it in some cases by treating the floating point value as if it were just a bunch of bits. To me logical_cast feels like the right place to do this, but floating point values are not allowed to be cast to ints.

* nor can an INT32 columm be logically cast to a FLOAT32 since what the bits represent differs.

If we could reduce that restriction that would be great. If you would rather call it something like bit_cast so it is more in line with what C++ does that is fine too. In the worst case I can just implement something myself in the JNI layer, but I thought it might be good to ask here just in case the python side felt that they wanted this too.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS labels Nov 23, 2020
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. good first issue Good for newcomers and removed Needs Triage Need team to review and classify labels Dec 3, 2020
@revans2 revans2 changed the title [FEA] logical_cast from FLOAT32 ot INT32 and FLOAT64 to INT64 or bit_cast for the same [FEA] logical_cast from FLOAT32 to INT32 and FLOAT64 to INT64 or bit_cast for the same Dec 16, 2020
@ttnghia ttnghia self-assigned this Feb 8, 2021
@nvdbaranec
Copy link
Contributor

nvdbaranec commented Feb 10, 2021

@jrhemstad Any objections to adding a bit_cast function to column_view.hpp? I figure it'd follow the same restrictions as
https://en.cppreference.com/w/cpp/numeric/bit_cast

@jrhemstad
Copy link
Contributor

jrhemstad commented Feb 10, 2021

logical_cast was intended to be a moral equivalent of bit_cast (I wouldn't even be opposed to renaming). I can't recall why we thought it was a good idea to disallow int32 to float32, but I think it's fine relaxing that requirement.

Looking back at the original issue description the prohibition against integral <-> floating point of the same size wasn't mentioned: #4794. So I'm guessing this was just an oversight.

@ttnghia
Copy link
Contributor

ttnghia commented Feb 10, 2021

Currently, logical_cast allows casting between timestamp types and other integer types. If we allow casting between integer and float/double, does that mean we also allow casting between timestamp types and float/double too?

From my point of view, just allow casting between any types that have the same data size.

@jrhemstad
Copy link
Contributor

I think as @nvdbaranec mentioned we can just adopt the rules exactly from bit_cast:

template <class To, class From>
typename std::enable_if_t<
    sizeof(To) == sizeof(From) &&
    std::is_trivially_copyable_v<From> &&
    std::is_trivially_copyable_v<To>,
    To>

@ttnghia
Copy link
Contributor

ttnghia commented Feb 11, 2021

@jrhemstad @nvdbaranec @codereport

In the case of numeric::fixed_point data types, which have two data members (_value and _scale), their sizes are twice the size of their data value. Take numeric::fixed_point<int> for example, its size is 8 bytes while int32 size is 4 bytes. So:

  • Should we allow to cast numeric::fixed_point<int> to int64 and NOT allow to cast to int32?
  • std::is_trivially_copyable_v return false for this type. Should we add support for it (like adding cudf::is_trivially_copyable which returns true for numeric::fixed_point types)?

@jrhemstad
Copy link
Contributor

  • Should we allow to cast numeric::fixed_point<int> to int64 and NOT allow to cast to int32

No, bit_cast should just take the value member.

std::is_trivially_copyable_v return false for this type. Should we add support for it (like adding cudf::is_trivially_copyable which returns true for numeric::fixed_point types)?

Providing a user-defined overload for is_trivially_copyable is UB. I think the fact that fixed_point isn't trivially copyable is an oversight that we'll need to resolve.

@codereport
Copy link
Contributor

@ttnghia I setup a unit test

EXPECT_TRUE(std::is_trivially_copyable<decimal32>::value);

and it passed. Do you mind sharing the code that returned false?

@ttnghia
Copy link
Contributor

ttnghia commented Feb 11, 2021

Ah, I'm sorry, I did something wrong with my test thus it returned false. Yeah, your test is correct, and the result should be true.

@jrhemstad I'm not very clear. If we allow casting the column_view of 8-bytes elements (fixed_point<int>) into a column_view of 4-bytes elements (int32) then how can we index into the elements correctly? You call data() and it returns an int* pointer. Indexing using that pointer will point to the wrong place.

I think @revans2 was right in the first place: we should not allow doing logical_cast between types having different sizes. So, numeric::fixed_point<int> should not be allowed to logical_cast into int. If you want to cast, then use other types of cast (such as static_cast).

@jrhemstad
Copy link
Contributor

@jrhemstad I'm not very clear. If we allow casting the column_view of 8-bytes elements (fixed_point<int>) into a column_view of 4-bytes elements (int32) then how can we index into the elements correctly? You call data() and it returns an int* pointer. Indexing using that pointer will point to the wrong place.

I think @revans2 was right in the first place: we should not allow doing logical_cast between types having different sizes. So, numeric::fixed_point<int> should not be allowed to logical_cast into int. If you want to cast, then use other types of cast (such as static_cast).

Your confusion is understandable. For the DECIMAL type columns we do something unique. Since the scale is common among all of the elements in the column, we don't actually store it with every element as it would be a waste of memory. We store it in the column's data_type metadata. So what is stored in each element of a decimal column is just it's integer representation, e.g., for decimal32 that is int32_t. So when you bit_cast a decimal column, you're really just bitcasting it's integer representation.

@revans2
Copy link
Contributor Author

revans2 commented Feb 11, 2021

@ttnghia

To be clear I don't want a logical_cast between different sized values. What I want is to be able to take a FLOAT32, and interpret that data as a INT32 so I can compare it to the bit representation of -0.0 as defined by the ieee specification 0x80000000. That way I can fix a long standing bug in the Spark plugin's sorting implementation because Java in some APIs for what ever stupid reason has -0.0 < 0.0 and this bleeds into Spark as well.

I would do the same thing for FLOAT64 separately doing a logical_cast to INT64 and using a different mask. There are also odd corner cases with NaNs that I would want to do similar things with. I don't want to put the ugliness of this directly into cudf, so all I really want is a relief valve that I can then use to work around it.

@nvdbaranec
Copy link
Contributor

So when you bit_cast a decimal column, you're really just bitcasting it's integer representation.

@jrhemstad Just to clarify: are we going with logical_cast for this or bit_cast?

@jrhemstad
Copy link
Contributor

So when you bit_cast a decimal column, you're really just bitcasting it's integer representation.

@jrhemstad Just to clarify: are we going with logical_cast for this or bit_cast?

I kind of like the idea of renaming logical_cast to bit_cast. It more closely captures the spirit of what it does.

@ttnghia
Copy link
Contributor

ttnghia commented Feb 11, 2021

Currently, I'm dealing with unit tests for this. Previously, we cast only between integer types thus we can run a unit test to compare the values of data elements before and after casting. However, now we can cast between integer and other types such as int32 and float32, or int8 and bool. So, I'm thinking of several ways to handle the unit tests:

  • Drop those tests completely.
  • Cast from column1 (type1) to column2 (type2), then for each element val2 of column2, call val1 = *reinterpret_cast<type1*>(&val2) and compare with the value from colum1.

How do you think? Since the underlying data remains unchanged, comparing the same value but viewed in different types does not make much sense.

@revans2
Copy link
Contributor Author

revans2 commented Feb 11, 2021

@ttnghia Why not use std::bit_cast in the test? That is the goal after all.

@ttnghia
Copy link
Contributor

ttnghia commented Feb 11, 2021

That's a good idea! However, bit_cast is from C++20 so we can't use it, unfortunately.

@jrhemstad
Copy link
Contributor

One easy way to test it might be just to test that you can round trip from one type, to another, and back to the original and assert that the original column and the roundtripped column are equal.

rapids-bot bot pushed a commit that referenced this issue Mar 3, 2021
…7373)

This MR resolves #6834. It renames `logical_cast` to `bit_cast` which performs casting following the rules of `std::bit_cast`: the two types (FromType and ToType) must have the same size and are trivially copyable.

Authors:
  - Nghia Truong (@ttnghia)

Approvers:
  - @nvdbaranec
  - David (@davidwendt)
  - Keith Kraus (@kkraus14)

URL: #7373
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants