Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column is not converting to numeric when errors=coerce #17125

Closed
Sarickshah opened this issue Jul 31, 2017 · 13 comments
Closed

Column is not converting to numeric when errors=coerce #17125

Sarickshah opened this issue Jul 31, 2017 · 13 comments
Labels
IO CSV read_csv, to_csv Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@Sarickshah
Copy link

Sarickshah commented Jul 31, 2017

I read in my dataframe with

 pd.read_csv('df.csv')

And then I run the code:

df['a'] = pd.to_numeric(df['a'], errors='coerce')

but the column does not get converted. When I use errors = 'raise' it gives me the numbers that are not convertible but it should be dropping them with coerce.... This was working perfectly in Pandas 0.19 and i Updated to 0.20.3. Did the way to_numeric works change between the two versions?

@jreback
Copy link
Contributor

jreback commented Jul 31, 2017

this would need a reproducible example

@gfyoung gfyoung added IO CSV read_csv, to_csv Numeric Operations Arithmetic, Comparison, and Logical operations labels Jul 31, 2017
@mficek
Copy link

mficek commented Jul 31, 2017

I think it's duplicate with #17007

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

@mficek : Potentially, but can't confirm yet

@jorisvandenbossche
Copy link
Member

It's certainly not an exact duplicate, as the example shown in #17007 also does not work correctly in 0.19, while here it is mentioned it worked in 0.19

@Sarickshah
Copy link
Author

Sarickshah commented Aug 1, 2017

import pandas as pd

df = pd.DataFrame({'number':["00000_81234523499", "81654839"], 'date':['2017-07-28', '2017-07-29']})

pd.to_numeric(df.number, errors='coerce')

And the numbers stay as strings

@gfyoung
Copy link
Member

gfyoung commented Aug 1, 2017

@Sarickshah : Thanks for this! Could you do us a favor and move your example to your issue description above? Also, if you could provide the output that you're seeing as well as the expected output, that would be great for us as well.

@jorisvandenbossche
Copy link
Member

On 0.19 the first one is coerces (which seems expected, since it raises an error on parsing, but errors='coerce'):

In [7]: pd.to_numeric(df.number, errors='coerce')
Out[7]: 
0           NaN
1    81654839.0
Name: number, dtype: float64

@jorisvandenbossche jorisvandenbossche added this to the 0.21.0 milestone Aug 1, 2017
@jorisvandenbossche
Copy link
Member

Actually, this seems to work as well on 0.20.3:

In [1]: df = pd.DataFrame({'number':["00000_81234523499", "81654839"], 'date':['2017-07-28', '2017-07-29']})
   ...: pd.to_numeric(df.number, errors='coerce')
Out[1]: 
0           NaN
1    81654839.0
Name: number, dtype: float64

In [2]: pd.__version__
Out[2]: '0.20.3'

@Sarickshah Can you show the exact output of what you get?

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

looks fixed in 0.20.3.

@jreback jreback closed this as completed Sep 23, 2017
@blakebjorn
Copy link

blakebjorn commented Sep 25, 2017

Doesn't seem to be fixed, could be something to do with the python binaries if it isn't reproducible? (Windows 7 x64 here)

import pandas as pd
df = pd.DataFrame([{"UPC":"12345678901234567890"},{"UPC":"1234567890"},{"UPC":"ITEM"}])
print(pd.to_numeric(df['UPC'],errors='coerce'))
print(pd.__version__)

0    12345678901234567890
1              1234567890
2                    ITEM
Name: UPC, dtype: object
0.20.3

I think it has something to do with the long (>20 character) number strings. This is taken from a sheet of ~6 million digits. If i do something like:

def fix_number(e):
    try:
        return float(e)
    except:
        return np.nan
df['UPC'] = df['UPC'].apply(fix_number)

I get 5.2 million duplicate values - it seems like the function works until it encounters a problematic value .8 million rows in and then assigns the last valid retval to the remaining 5.2 million rows

Edit - This works:

print(pd.to_numeric(df2['UPC'].apply(lambda x: x[:19] if len(x)>19 else x),errors='coerce'))

but this doesn't:

print(pd.to_numeric(df2['UPC'].apply(lambda x: x[:20] if len(x)>20 else x),errors='coerce'))

So it looks like any string with a character count >= 20 will break the to_numeric function

@jorisvandenbossche
Copy link
Member

Indeed, that example is not working correctly (both on master as in 0.20.3). The other example is working though, so the difference indeed seems to be the large number.

So it seems that when the value would be converted to uint64 (instead of int64), the errors='coerce' is not working.

@jorisvandenbossche
Copy link
Member

In [89]: s = pd.Series(["12345678901234567890", "1234567890", "ITEM"])

In [90]: pd.to_numeric(s, errors='coerce')
Out[90]: 
0    12345678901234567890
1              1234567890
2                    ITEM
dtype: object

In [91]: s = pd.Series(["12345678901234567890", "1234567890"])

In [92]: pd.to_numeric(s, errors='coerce')
Out[92]: 
0    12345678901234567890
1              1234567890
dtype: uint64

So you can see that the parsing of the big value (> 20 chars) itself is working, as the return value is uint64. When a NaN has to be introduced, it should just be converted to float64 as it happens with int64.

@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 25, 2017
blakebjorn referenced this issue in blakebjorn/pandas Sep 26, 2017
@blakebjorn
Copy link

I think the biggest point of confusion is that there is no exception raised when errors="coerce" and it fails to coerce anything. As this is more of a limitation of the underlying numpy dtypes I don't think there is a real fix here.

Something simple like this would solve the point of confusion, and users would have the ability to figure out how to best handle it from there on out, whether it being to drop large numbers from the dataframe or leaving them as objects and manually pruning errors.

I don't think coercing uint64 to float64 is the best way to handle it, and I would go as far as to suggest there should be a warning for int64 -> float64 conversion, because anything above 2**53 will create unforeseen problems for people unaware of the float64 limitations, for example:

print("%f" % np.float64(9007199254740992))
print("%f" % np.float64(9007199254740993))
>>>9007199254740992.000000
>>>9007199254740992.000000

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.21.1 Oct 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

6 participants