Column is not converting to numeric when errors=coerce #17125

Sarickshah · 2017-07-31T09:31:17Z

I read in my dataframe with

 pd.read_csv('df.csv')

And then I run the code:

df['a'] = pd.to_numeric(df['a'], errors='coerce')

but the column does not get converted. When I use errors = 'raise' it gives me the numbers that are not convertible but it should be dropping them with coerce.... This was working perfectly in Pandas 0.19 and i Updated to 0.20.3. Did the way to_numeric works change between the two versions?

The text was updated successfully, but these errors were encountered:

jreback · 2017-07-31T10:23:55Z

this would need a reproducible example

mficek · 2017-07-31T15:19:07Z

I think it's duplicate with #17007

gfyoung · 2017-07-31T15:23:10Z

@mficek : Potentially, but can't confirm yet

jorisvandenbossche · 2017-07-31T19:00:59Z

It's certainly not an exact duplicate, as the example shown in #17007 also does not work correctly in 0.19, while here it is mentioned it worked in 0.19

Sarickshah · 2017-08-01T04:16:54Z

import pandas as pd

df = pd.DataFrame({'number':["00000_81234523499", "81654839"], 'date':['2017-07-28', '2017-07-29']})

pd.to_numeric(df.number, errors='coerce')

And the numbers stay as strings

gfyoung · 2017-08-01T06:39:06Z

@Sarickshah : Thanks for this! Could you do us a favor and move your example to your issue description above? Also, if you could provide the output that you're seeing as well as the expected output, that would be great for us as well.

jorisvandenbossche · 2017-08-01T08:14:06Z

On 0.19 the first one is coerces (which seems expected, since it raises an error on parsing, but errors='coerce'):

In [7]: pd.to_numeric(df.number, errors='coerce')
Out[7]: 
0           NaN
1    81654839.0
Name: number, dtype: float64

jorisvandenbossche · 2017-08-01T08:27:47Z

Actually, this seems to work as well on 0.20.3:

In [1]: df = pd.DataFrame({'number':["00000_81234523499", "81654839"], 'date':['2017-07-28', '2017-07-29']})
   ...: pd.to_numeric(df.number, errors='coerce')
Out[1]: 
0           NaN
1    81654839.0
Name: number, dtype: float64

In [2]: pd.__version__
Out[2]: '0.20.3'

@Sarickshah Can you show the exact output of what you get?

jreback · 2017-09-23T20:54:39Z

looks fixed in 0.20.3.

blakebjorn · 2017-09-25T14:47:27Z

Doesn't seem to be fixed, could be something to do with the python binaries if it isn't reproducible? (Windows 7 x64 here)

import pandas as pd
df = pd.DataFrame([{"UPC":"12345678901234567890"},{"UPC":"1234567890"},{"UPC":"ITEM"}])
print(pd.to_numeric(df['UPC'],errors='coerce'))
print(pd.__version__)

0    12345678901234567890
1              1234567890
2                    ITEM
Name: UPC, dtype: object
0.20.3

I think it has something to do with the long (>20 character) number strings. This is taken from a sheet of ~6 million digits. If i do something like:

def fix_number(e):
    try:
        return float(e)
    except:
        return np.nan
df['UPC'] = df['UPC'].apply(fix_number)

I get 5.2 million duplicate values - it seems like the function works until it encounters a problematic value .8 million rows in and then assigns the last valid retval to the remaining 5.2 million rows

Edit - This works:

print(pd.to_numeric(df2['UPC'].apply(lambda x: x[:19] if len(x)>19 else x),errors='coerce'))

but this doesn't:

print(pd.to_numeric(df2['UPC'].apply(lambda x: x[:20] if len(x)>20 else x),errors='coerce'))

So it looks like any string with a character count >= 20 will break the to_numeric function

jorisvandenbossche · 2017-09-25T15:29:27Z

Indeed, that example is not working correctly (both on master as in 0.20.3). The other example is working though, so the difference indeed seems to be the large number.

So it seems that when the value would be converted to uint64 (instead of int64), the errors='coerce' is not working.

jorisvandenbossche · 2017-09-25T15:37:08Z

In [89]: s = pd.Series(["12345678901234567890", "1234567890", "ITEM"])

In [90]: pd.to_numeric(s, errors='coerce')
Out[90]: 
0    12345678901234567890
1              1234567890
2                    ITEM
dtype: object

In [91]: s = pd.Series(["12345678901234567890", "1234567890"])

In [92]: pd.to_numeric(s, errors='coerce')
Out[92]: 
0    12345678901234567890
1              1234567890
dtype: uint64

So you can see that the parsing of the big value (> 20 chars) itself is working, as the return value is uint64. When a NaN has to be introduced, it should just be converted to float64 as it happens with int64.

… to exist in the same array.

blakebjorn · 2017-09-26T15:49:22Z

I think the biggest point of confusion is that there is no exception raised when errors="coerce" and it fails to coerce anything. As this is more of a limitation of the underlying numpy dtypes I don't think there is a real fix here.

Something simple like this would solve the point of confusion, and users would have the ability to figure out how to best handle it from there on out, whether it being to drop large numbers from the dataframe or leaving them as objects and manually pruning errors.

I don't think coercing uint64 to float64 is the best way to handle it, and I would go as far as to suggest there should be a warning for int64 -> float64 conversion, because anything above 2**53 will create unforeseen problems for people unaware of the float64 limitations, for example:

print("%f" % np.float64(9007199254740992))
print("%f" % np.float64(9007199254740993))
>>>9007199254740992.000000
>>>9007199254740992.000000

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

Closes gh-17007. Closes gh-17125.

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

gfyoung added IO CSV read_csv, to_csv Numeric Operations Arithmetic, Comparison, and Logical operations labels Jul 31, 2017

jorisvandenbossche added this to the 0.21.0 milestone Aug 1, 2017

jreback closed this as completed Sep 23, 2017

jorisvandenbossche reopened this Sep 25, 2017

jreback modified the milestones: 0.21.0, Next Major Release Sep 25, 2017

jreback added Difficulty Intermediate labels Sep 25, 2017

blakebjorn referenced this issue in blakebjorn/pandas Sep 26, 2017

Fix: Raise exception when errors="coerce" would cause uint64 and NaNs…

d6f646e

… to exist in the same array.

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017

BUG: Coerce to numeric despite uint64 conflict

23dbbb1

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

gfyoung mentioned this issue Oct 9, 2017

BUG: Coerce to numeric despite uint64 conflict #17823

Merged

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017

BUG: Coerce to numeric despite uint64 conflict

16bf62e

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017

BUG: Coerce to numeric despite uint64 conflict

0904a55

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

jorisvandenbossche modified the milestones: Next Major Release, 0.21.1 Oct 9, 2017

jorisvandenbossche added this to the 0.21.0 milestone Oct 9, 2017

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 9, 2017

BUG: Coerce to numeric despite uint64 conflict

93e52e9

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

jreback closed this as completed in #17823 Oct 9, 2017

jreback pushed a commit that referenced this issue Oct 9, 2017

BUG: Coerce to numeric despite uint64 conflict (#17823)

3ba2cff

Closes gh-17007. Closes gh-17125.

ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017

BUG: Coerce to numeric despite uint64 conflict (pandas-dev#17823)

987a8d4

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017

BUG: Coerce to numeric despite uint64 conflict (pandas-dev#17823)

c8cb64d

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

BUG: Coerce to numeric despite uint64 conflict (pandas-dev#17823)

e9015c3

Closes pandas-devgh-17007. Closes pandas-devgh-17125.

fakio mentioned this issue Jan 24, 2019

pd.to_numeric silently not converting when errors='coerce' #24910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column is not converting to numeric when errors=coerce #17125

Column is not converting to numeric when errors=coerce #17125

Sarickshah commented Jul 31, 2017 •

edited by jorisvandenbossche

Loading

jreback commented Jul 31, 2017

mficek commented Jul 31, 2017 •

edited

Loading

gfyoung commented Jul 31, 2017

jorisvandenbossche commented Jul 31, 2017

Sarickshah commented Aug 1, 2017 •

edited

Loading

gfyoung commented Aug 1, 2017

jorisvandenbossche commented Aug 1, 2017

jorisvandenbossche commented Aug 1, 2017

jreback commented Sep 23, 2017

blakebjorn commented Sep 25, 2017 •

edited

Loading

jorisvandenbossche commented Sep 25, 2017

jorisvandenbossche commented Sep 25, 2017

blakebjorn commented Sep 26, 2017

Column is not converting to numeric when errors=coerce #17125

Column is not converting to numeric when errors=coerce #17125

Comments

Sarickshah commented Jul 31, 2017 • edited by jorisvandenbossche Loading

jreback commented Jul 31, 2017

mficek commented Jul 31, 2017 • edited Loading

gfyoung commented Jul 31, 2017

jorisvandenbossche commented Jul 31, 2017

Sarickshah commented Aug 1, 2017 • edited Loading

gfyoung commented Aug 1, 2017

jorisvandenbossche commented Aug 1, 2017

jorisvandenbossche commented Aug 1, 2017

jreback commented Sep 23, 2017

blakebjorn commented Sep 25, 2017 • edited Loading

jorisvandenbossche commented Sep 25, 2017

jorisvandenbossche commented Sep 25, 2017

blakebjorn commented Sep 26, 2017

Sarickshah commented Jul 31, 2017 •

edited by jorisvandenbossche

Loading

mficek commented Jul 31, 2017 •

edited

Loading

Sarickshah commented Aug 1, 2017 •

edited

Loading

blakebjorn commented Sep 25, 2017 •

edited

Loading