[BUG] csv reader of float32 shows extra digits #1340

daxiongshu · 2019-04-03T00:01:03Z

Describe the bug
For example, csv reader of float32 sometimes shows 6 digits when the column in csv file only has 4 digits. The extra digits are mostly 00, 01 or 99 in the end.

Steps/Code to reproduce bug
When reading a sample csv file as follows:

ID_code,target,var_0
train_0,0,8.9255
train_1,0,11.5006
train_2,0,8.6093

The code is as follows:

import cudf as gd
import pandas as pd

col_names = ['ID_code','target','var_0']
dtypes = ['str','int32','float32']
gdf = gd.read_csv('sample.csv',names=col_names,dtype=dtypes,skiprows=1)
print('cudf read_csv')
print(gdf.to_pandas().head())

print()
print('pandas read_csv')
pdf = pd.read_csv('sample.csv')
print(pdf.head())

The output is as follows and the mismatch is in the last row of var_0.

cudf read_csv
   ID_code  target      var_0
0  train_0       0   8.925500
1  train_1       0  11.500600
2  train_2       0   8.609301

pandas read_csv
   ID_code  target    var_0
0  train_0       0   8.9255
1  train_1       0  11.5006
2  train_2       0   8.6093

Expected behavior
The same output as pandas and as the input csv file.

Environment details (please complete the following information):

Environment location: Bare-metal
Method of cuDF install: conda 0.6 install, version 0.6.1+0.gbeb4ef3.dirty.

Additional context
For sensitive data like finance data, the dataframe has to be exactly the same as the csv file.

The text was updated successfully, but these errors were encountered:

harrism · 2019-04-03T00:13:29Z

"For sensitive data like finance data, the dataframe has to be exactly the same as the csv file."

It is not actually possible to store exact decimal values in floating point. Binary numbers (like float32) cannot exactly represent all decimal numbers. For example, try entering 8.6093 into the Decimal Representation box on this site: https://www.h-schmidt.net/FloatConverter/IEEE754.html

You will see that the number actually stored in float32 is 8.60929965972900390625, which when rounded to 6 decimal places is 8.609301.

The issue here is just a matter of formatting the output -- Pandas formats to 4 digits while cudf is formatting to 6 digits. Perhaps we should change the default formatting to match.

In any case, this has nothing to do with read_csv and is just an output formatting issue. @kkraus14 may have an opinion on what action should be taken.

daxiongshu · 2019-04-03T00:46:38Z

I see. Thank you. Yeah, I tried different csv file and confirmed it is just a format thing and it is also consistent. In my opinion, maybe no action is needed.

daxiongshu added Needs Triage Need team to review and classify bug Something isn't working labels Apr 3, 2019

harrism added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Apr 3, 2019

kkraus14 closed this as completed Jul 3, 2019

harrism mentioned this issue Feb 5, 2024

[BUG] FLOAT32 rounding more inaccurate than necessary #14528

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] csv reader of float32 shows extra digits #1340

[BUG] csv reader of float32 shows extra digits #1340

daxiongshu commented Apr 3, 2019

harrism commented Apr 3, 2019

daxiongshu commented Apr 3, 2019

[BUG] csv reader of float32 shows extra digits #1340

[BUG] csv reader of float32 shows extra digits #1340

Comments

daxiongshu commented Apr 3, 2019

harrism commented Apr 3, 2019

daxiongshu commented Apr 3, 2019