fread - if there is one string in a numeric column, guess it represents NA #2100

mattdowle · 2017-04-04T22:04:07Z

Instead of bumping a numeric column to character on seeing the first character value, it could wait and see if that was the only character value present in the whole column. If so it could assume it is an NA value with warning and keep the column as numeric. Saves the user having to rerun by passing na.strings.
(Suggested by Pasha not me, a great idea.)

mattdowle · 2017-04-07T02:09:53Z

Once recurring large negative value could be optionally converted to NA, as well. This would save providing a mechanism to define different NA numeric values for different columns; e.g. -999 in one column but -9999.9 in another column where 999 is valid observation value. Especially if the large negative outlier is the only negative.

mattdowle · 2017-09-21T18:15:36Z

As described above, one string value such as "#N/A" is relatively straightforward to automatically detect: the column would be full of numeric values other than that one value which wasn't a valid numeric.

How to detect the "obvious" outlier when it's numeric though e.g., -9999, or even 99 in a column of numbers in the range [1,40]. Currently the user can pass in such numeric values manually in na.strings and if they occur in any field in any column, they'll be interpreted as NA. However, some files have some columns using different values; e.g. 99 in one column, but -9999 in another.

Was just speaking to Leland Wilkinson in our office about this.
How best to detect a single wild outlier, most efficiently?

Proposal: fread could use its large sample to calculate the min (min1) and max (max1) trivially, but also the 2nd smallest (min2) and 2nd largest value (max2). Simple and efficient without a 2nd pass.
Then we could do a bunch of things; e.g.,
Let diffMin = min2-min1
and diffMax = max1-max2
and range = max2-min2 (range excluding any potential single outlier)

If one of diffMin/range > limit or diffMax/range > limit is true, we could then test max1 or min1 to see if it is all one digit such as 999 or -9999.0. If so, by default, treat it as NA automatically with warning. This feature could be controlled (including turning off) via a new fread parameter.

Views?

st-pasha · 2017-09-21T21:23:27Z

There are naturally occurring quantities that are distributed on log-scale. These could be: GDPs of countries, individual incomes, sizes of files on disk, masses of stars, energies of particles in cosmic rays (see Oh-My-God particle), etc.

For quantities like these it is expected that the largest value would look like an outlier. Nonetheless, it would be a valid value that should not be altered or removed in any way.

Even as I agree with Leland that there are systems that encode NAs as 999s (shame on them!), and that has led real people to make errors in their data analysis -- nevertheless, I think it is ultimately a judgement call whether a particular outlier is NA or anything else, and fread is not in a position to make such a call.

MichaelChirico · 2017-09-21T23:24:47Z

I'm with @st-pasha here (w.r.t automatically detecting "numeric" NA). This is in keeping with the philosophy of type conversions found elsewhere in fread -- up to the user to know their own data set. Perhaps the most it makes sense to do in this regard would be to include an alert in verbose = TRUE output like "Value XXXX detected as anomalous in column CCCC; perhaps it represents NA in this data set?"

As an aside, I also think there are valid (ish) cases for using "numeric" NA -- the example that comes to mind is the Common Core of Data. See here -- -1, -2, -9, M, and N can all be used for "missing" data, but with different interpretations:

M: when alphanumeric data are missing; that is, a value is expected but none was measured.
1: when numeric data are missing; that is, a value is expected but none was measured.
N: when alphanumeric data are not applicable; that is, a value is neither expected nor measured.
-2: when numeric data are not applicable; that is, a value is neither expected nor measured.
-9: when the submitted data item does not meet NCES data quality standards; the value is suppressed.

Basically, NA may be too catch-all for a variety of reasons why a data value is missing; in certain situations, an analyst may want to incorporate this.

mattdowle added fread enhancement labels Apr 4, 2017

st-pasha mentioned this issue Jul 6, 2017

Master task for fread bugs / proposals #2247

Closed

st-pasha mentioned this issue May 7, 2018

na.strings is too literal when column is quoted on file #2586

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread - if there is one string in a numeric column, guess it represents NA #2100

fread - if there is one string in a numeric column, guess it represents NA #2100

mattdowle commented Apr 4, 2017 •

edited

Loading

mattdowle commented Apr 7, 2017

mattdowle commented Sep 21, 2017

st-pasha commented Sep 21, 2017

MichaelChirico commented Sep 21, 2017 •

edited

Loading

fread - if there is one string in a numeric column, guess it represents NA #2100

fread - if there is one string in a numeric column, guess it represents NA #2100

Comments

mattdowle commented Apr 4, 2017 • edited Loading

mattdowle commented Apr 7, 2017

mattdowle commented Sep 21, 2017

st-pasha commented Sep 21, 2017

MichaelChirico commented Sep 21, 2017 • edited Loading

mattdowle commented Apr 4, 2017 •

edited

Loading

MichaelChirico commented Sep 21, 2017 •

edited

Loading