Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] csv reader of float32 shows extra digits #1340

Closed
daxiongshu opened this issue Apr 3, 2019 · 2 comments
Closed

[BUG] csv reader of float32 shows extra digits #1340

daxiongshu opened this issue Apr 3, 2019 · 2 comments
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@daxiongshu
Copy link

Describe the bug
For example, csv reader of float32 sometimes shows 6 digits when the column in csv file only has 4 digits. The extra digits are mostly 00, 01 or 99 in the end.

Steps/Code to reproduce bug
When reading a sample csv file as follows:

ID_code,target,var_0
train_0,0,8.9255
train_1,0,11.5006
train_2,0,8.6093

The code is as follows:

import cudf as gd
import pandas as pd

col_names = ['ID_code','target','var_0']
dtypes = ['str','int32','float32']
gdf = gd.read_csv('sample.csv',names=col_names,dtype=dtypes,skiprows=1)
print('cudf read_csv')
print(gdf.to_pandas().head())

print()
print('pandas read_csv')
pdf = pd.read_csv('sample.csv')
print(pdf.head())

The output is as follows and the mismatch is in the last row of var_0.

cudf read_csv
   ID_code  target      var_0
0  train_0       0   8.925500
1  train_1       0  11.500600
2  train_2       0   8.609301

pandas read_csv
   ID_code  target    var_0
0  train_0       0   8.9255
1  train_1       0  11.5006
2  train_2       0   8.6093

Expected behavior
The same output as pandas and as the input csv file.

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Method of cuDF install: conda 0.6 install, version 0.6.1+0.gbeb4ef3.dirty.

Additional context
For sensitive data like finance data, the dataframe has to be exactly the same as the csv file.

@daxiongshu daxiongshu added Needs Triage Need team to review and classify bug Something isn't working labels Apr 3, 2019
@harrism harrism added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Apr 3, 2019
@harrism
Copy link
Member

harrism commented Apr 3, 2019

"For sensitive data like finance data, the dataframe has to be exactly the same as the csv file."

It is not actually possible to store exact decimal values in floating point. Binary numbers (like float32) cannot exactly represent all decimal numbers. For example, try entering 8.6093 into the Decimal Representation box on this site: https://www.h-schmidt.net/FloatConverter/IEEE754.html

You will see that the number actually stored in float32 is 8.60929965972900390625, which when rounded to 6 decimal places is 8.609301.

The issue here is just a matter of formatting the output -- Pandas formats to 4 digits while cudf is formatting to 6 digits. Perhaps we should change the default formatting to match.

In any case, this has nothing to do with read_csv and is just an output formatting issue. @kkraus14 may have an opinion on what action should be taken.

@daxiongshu
Copy link
Author

I see. Thank you. Yeah, I tried different csv file and confirmed it is just a format thing and it is also consistent. In my opinion, maybe no action is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

3 participants