[FEA] logical_cast from FLOAT32 to INT32 and FLOAT64 to INT64 or bit_cast for the same #6834

revans2 · 2020-11-23T17:05:36Z

Spark is really inconsistent in how it handles some values like -0.0 vs 0.0 and the various NaN values that are possible. I don't expect cuDF to be aware of any of this, but I would like the ability to work around it in some cases by treating the floating point value as if it were just a bunch of bits. To me logical_cast feels like the right place to do this, but floating point values are not allowed to be cast to ints.

cudf/cpp/include/cudf/column/column_view.hpp

Line 580 in dbeac89

    
            * nor can an INT32 columm be logically cast to a FLOAT32 since what the bits represent differs.

If we could reduce that restriction that would be great. If you would rather call it something like bit_cast so it is more in line with what C++ does that is fine too. In the worst case I can just implement something myself in the JNI layer, but I thought it might be good to ask here just in case the python side felt that they wanted this too.

The text was updated successfully, but these errors were encountered:

nvdbaranec · 2021-02-10T17:10:39Z

@jrhemstad Any objections to adding a bit_cast function to column_view.hpp? I figure it'd follow the same restrictions as
https://en.cppreference.com/w/cpp/numeric/bit_cast

jrhemstad · 2021-02-10T17:37:58Z

logical_cast was intended to be a moral equivalent of bit_cast (I wouldn't even be opposed to renaming). I can't recall why we thought it was a good idea to disallow int32 to float32, but I think it's fine relaxing that requirement.

Looking back at the original issue description the prohibition against integral <-> floating point of the same size wasn't mentioned: #4794. So I'm guessing this was just an oversight.

ttnghia · 2021-02-10T18:31:06Z

Currently, logical_cast allows casting between timestamp types and other integer types. If we allow casting between integer and float/double, does that mean we also allow casting between timestamp types and float/double too?

From my point of view, just allow casting between any types that have the same data size.

jrhemstad · 2021-02-10T19:15:15Z

I think as @nvdbaranec mentioned we can just adopt the rules exactly from bit_cast:

template <class To, class From>
typename std::enable_if_t<
    sizeof(To) == sizeof(From) &&
    std::is_trivially_copyable_v<From> &&
    std::is_trivially_copyable_v<To>,
    To>

ttnghia · 2021-02-11T00:10:21Z

@jrhemstad @nvdbaranec @codereport

In the case of numeric::fixed_point data types, which have two data members (_value and _scale), their sizes are twice the size of their data value. Take numeric::fixed_point<int> for example, its size is 8 bytes while int32 size is 4 bytes. So:

Should we allow to cast numeric::fixed_point<int> to int64 and NOT allow to cast to int32?
std::is_trivially_copyable_v return false for this type. Should we add support for it (like adding cudf::is_trivially_copyable which returns true for numeric::fixed_point types)?

jrhemstad · 2021-02-11T00:16:52Z

Should we allow to cast numeric::fixed_point<int> to int64 and NOT allow to cast to int32

No, bit_cast should just take the value member.

std::is_trivially_copyable_v return false for this type. Should we add support for it (like adding cudf::is_trivially_copyable which returns true for numeric::fixed_point types)?

Providing a user-defined overload for is_trivially_copyable is UB. I think the fact that fixed_point isn't trivially copyable is an oversight that we'll need to resolve.

codereport · 2021-02-11T03:36:51Z

@ttnghia I setup a unit test

EXPECT_TRUE(std::is_trivially_copyable<decimal32>::value);

and it passed. Do you mind sharing the code that returned false?

ttnghia · 2021-02-11T03:44:56Z

Ah, I'm sorry, I did something wrong with my test thus it returned false. Yeah, your test is correct, and the result should be true.

@jrhemstad I'm not very clear. If we allow casting the column_view of 8-bytes elements (fixed_point<int>) into a column_view of 4-bytes elements (int32) then how can we index into the elements correctly? You call data() and it returns an int* pointer. Indexing using that pointer will point to the wrong place.

I think @revans2 was right in the first place: we should not allow doing logical_cast between types having different sizes. So, numeric::fixed_point<int> should not be allowed to logical_cast into int. If you want to cast, then use other types of cast (such as static_cast).

jrhemstad · 2021-02-11T14:23:53Z

@jrhemstad I'm not very clear. If we allow casting the column_view of 8-bytes elements (fixed_point<int>) into a column_view of 4-bytes elements (int32) then how can we index into the elements correctly? You call data() and it returns an int* pointer. Indexing using that pointer will point to the wrong place.

I think @revans2 was right in the first place: we should not allow doing logical_cast between types having different sizes. So, numeric::fixed_point<int> should not be allowed to logical_cast into int. If you want to cast, then use other types of cast (such as static_cast).

Your confusion is understandable. For the DECIMAL type columns we do something unique. Since the scale is common among all of the elements in the column, we don't actually store it with every element as it would be a waste of memory. We store it in the column's data_type metadata. So what is stored in each element of a decimal column is just it's integer representation, e.g., for decimal32 that is int32_t. So when you bit_cast a decimal column, you're really just bitcasting it's integer representation.

revans2 · 2021-02-11T14:40:56Z

@ttnghia

To be clear I don't want a logical_cast between different sized values. What I want is to be able to take a FLOAT32, and interpret that data as a INT32 so I can compare it to the bit representation of -0.0 as defined by the ieee specification 0x80000000. That way I can fix a long standing bug in the Spark plugin's sorting implementation because Java in some APIs for what ever stupid reason has -0.0 < 0.0 and this bleeds into Spark as well.

I would do the same thing for FLOAT64 separately doing a logical_cast to INT64 and using a different mask. There are also odd corner cases with NaNs that I would want to do similar things with. I don't want to put the ugliness of this directly into cudf, so all I really want is a relief valve that I can then use to work around it.

nvdbaranec · 2021-02-11T15:54:04Z

So when you bit_cast a decimal column, you're really just bitcasting it's integer representation.

@jrhemstad Just to clarify: are we going with logical_cast for this or bit_cast?

jrhemstad · 2021-02-11T16:43:52Z

So when you bit_cast a decimal column, you're really just bitcasting it's integer representation.

@jrhemstad Just to clarify: are we going with logical_cast for this or bit_cast?

I kind of like the idea of renaming logical_cast to bit_cast. It more closely captures the spirit of what it does.

ttnghia · 2021-02-11T20:46:42Z

Currently, I'm dealing with unit tests for this. Previously, we cast only between integer types thus we can run a unit test to compare the values of data elements before and after casting. However, now we can cast between integer and other types such as int32 and float32, or int8 and bool. So, I'm thinking of several ways to handle the unit tests:

Drop those tests completely.
Cast from column1 (type1) to column2 (type2), then for each element val2 of column2, call val1 = *reinterpret_cast<type1*>(&val2) and compare with the value from colum1.

How do you think? Since the underlying data remains unchanged, comparing the same value but viewed in different types does not make much sense.

revans2 · 2021-02-11T21:01:59Z

@ttnghia Why not use std::bit_cast in the test? That is the goal after all.

ttnghia · 2021-02-11T21:04:46Z

That's a good idea! However, bit_cast is from C++20 so we can't use it, unfortunately.

jrhemstad · 2021-02-11T21:37:26Z

One easy way to test it might be just to test that you can round trip from one type, to another, and back to the original and assert that the original column and the roundtripped column are equal.

@ttnghia

…7373) This MR resolves #6834. It renames `logical_cast` to `bit_cast` which performs casting following the rules of `std::bit_cast`: the two types (FromType and ToType) must have the same size and are trivially copyable. Authors: - Nghia Truong (@ttnghia) Approvers: - @nvdbaranec - David (@davidwendt) - Keith Kraus (@kkraus14) URL: #7373

revans2 added feature request New feature or request Needs Triage Need team to review and classify Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS labels Nov 23, 2020

This was referenced Nov 23, 2020

[BUG] Distinct count of floating point values differs with regular spark NVIDIA/spark-rapids#837

Closed

[BUG] -0.0 vs 0.0 is a hot mess NVIDIA/spark-rapids#294

Open

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. good first issue Good for newcomers and removed Needs Triage Need team to review and classify labels Dec 3, 2020

revans2 changed the title ~~[FEA] logical_cast from FLOAT32 ot INT32 and FLOAT64 to INT64 or bit_cast for the same~~ [FEA] logical_cast from FLOAT32 to INT32 and FLOAT64 to INT64 or bit_cast for the same Dec 16, 2020

ttnghia self-assigned this Feb 8, 2021

ttnghia mentioned this issue Feb 11, 2021

Rename logical_cast to bit_cast and allow additional conversions #7373

Merged

rapids-bot bot closed this as completed in #7373 Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] logical_cast from FLOAT32 to INT32 and FLOAT64 to INT64 or bit_cast for the same #6834

[FEA] logical_cast from FLOAT32 to INT32 and FLOAT64 to INT64 or bit_cast for the same #6834

revans2 commented Nov 23, 2020

nvdbaranec commented Feb 10, 2021 •

edited

Loading

jrhemstad commented Feb 10, 2021 •

edited

Loading

ttnghia commented Feb 10, 2021 •

edited

Loading

jrhemstad commented Feb 10, 2021

ttnghia commented Feb 11, 2021 •

edited

Loading

jrhemstad commented Feb 11, 2021

codereport commented Feb 11, 2021

ttnghia commented Feb 11, 2021 •

edited

Loading

jrhemstad commented Feb 11, 2021

revans2 commented Feb 11, 2021

nvdbaranec commented Feb 11, 2021

jrhemstad commented Feb 11, 2021

ttnghia commented Feb 11, 2021 •

edited

Loading

revans2 commented Feb 11, 2021

ttnghia commented Feb 11, 2021 •

edited

Loading

jrhemstad commented Feb 11, 2021

[FEA] logical_cast from FLOAT32 to INT32 and FLOAT64 to INT64 or bit_cast for the same #6834

[FEA] logical_cast from FLOAT32 to INT32 and FLOAT64 to INT64 or bit_cast for the same #6834

Comments

revans2 commented Nov 23, 2020

nvdbaranec commented Feb 10, 2021 • edited Loading

jrhemstad commented Feb 10, 2021 • edited Loading

ttnghia commented Feb 10, 2021 • edited Loading

jrhemstad commented Feb 10, 2021

ttnghia commented Feb 11, 2021 • edited Loading

jrhemstad commented Feb 11, 2021

codereport commented Feb 11, 2021

ttnghia commented Feb 11, 2021 • edited Loading

jrhemstad commented Feb 11, 2021

revans2 commented Feb 11, 2021

nvdbaranec commented Feb 11, 2021

jrhemstad commented Feb 11, 2021

ttnghia commented Feb 11, 2021 • edited Loading

revans2 commented Feb 11, 2021

ttnghia commented Feb 11, 2021 • edited Loading

jrhemstad commented Feb 11, 2021

nvdbaranec commented Feb 10, 2021 •

edited

Loading

jrhemstad commented Feb 10, 2021 •

edited

Loading

ttnghia commented Feb 10, 2021 •

edited

Loading

ttnghia commented Feb 11, 2021 •

edited

Loading

ttnghia commented Feb 11, 2021 •

edited

Loading

ttnghia commented Feb 11, 2021 •

edited

Loading

ttnghia commented Feb 11, 2021 •

edited

Loading