Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread - if there is one string in a numeric column, guess it represents NA #2100

Open
mattdowle opened this issue Apr 4, 2017 · 4 comments

Comments

@mattdowle
Copy link
Member

mattdowle commented Apr 4, 2017

Instead of bumping a numeric column to character on seeing the first character value, it could wait and see if that was the only character value present in the whole column. If so it could assume it is an NA value with warning and keep the column as numeric. Saves the user having to rerun by passing na.strings.
(Suggested by Pasha not me, a great idea.)

@mattdowle
Copy link
Member Author

Once recurring large negative value could be optionally converted to NA, as well. This would save providing a mechanism to define different NA numeric values for different columns; e.g. -999 in one column but -9999.9 in another column where 999 is valid observation value. Especially if the large negative outlier is the only negative.

@mattdowle
Copy link
Member Author

As described above, one string value such as "#N/A" is relatively straightforward to automatically detect: the column would be full of numeric values other than that one value which wasn't a valid numeric.

How to detect the "obvious" outlier when it's numeric though e.g., -9999, or even 99 in a column of numbers in the range [1,40]. Currently the user can pass in such numeric values manually in na.strings and if they occur in any field in any column, they'll be interpreted as NA. However, some files have some columns using different values; e.g. 99 in one column, but -9999 in another.

Was just speaking to Leland Wilkinson in our office about this.
How best to detect a single wild outlier, most efficiently?

Proposal: fread could use its large sample to calculate the min (min1) and max (max1) trivially, but also the 2nd smallest (min2) and 2nd largest value (max2). Simple and efficient without a 2nd pass.
Then we could do a bunch of things; e.g.,
Let diffMin = min2-min1
and diffMax = max1-max2
and range = max2-min2 (range excluding any potential single outlier)

If one of diffMin/range > limit or diffMax/range > limit is true, we could then test max1 or min1 to see if it is all one digit such as 999 or -9999.0. If so, by default, treat it as NA automatically with warning. This feature could be controlled (including turning off) via a new fread parameter.

Views?

@st-pasha
Copy link
Contributor

There are naturally occurring quantities that are distributed on log-scale. These could be: GDPs of countries, individual incomes, sizes of files on disk, masses of stars, energies of particles in cosmic rays (see Oh-My-God particle), etc.

For quantities like these it is expected that the largest value would look like an outlier. Nonetheless, it would be a valid value that should not be altered or removed in any way.

Even as I agree with Leland that there are systems that encode NAs as 999s (shame on them!), and that has led real people to make errors in their data analysis -- nevertheless, I think it is ultimately a judgement call whether a particular outlier is NA or anything else, and fread is not in a position to make such a call.

@MichaelChirico
Copy link
Member

MichaelChirico commented Sep 21, 2017

I'm with @st-pasha here (w.r.t automatically detecting "numeric" NA). This is in keeping with the philosophy of type conversions found elsewhere in fread -- up to the user to know their own data set. Perhaps the most it makes sense to do in this regard would be to include an alert in verbose = TRUE output like "Value XXXX detected as anomalous in column CCCC; perhaps it represents NA in this data set?"

As an aside, I also think there are valid (ish) cases for using "numeric" NA -- the example that comes to mind is the Common Core of Data. See here -- -1, -2, -9, M, and N can all be used for "missing" data, but with different interpretations:

  • M: when alphanumeric data are missing; that is, a value is expected but none was measured.
  • 1: when numeric data are missing; that is, a value is expected but none was measured.
  • N: when alphanumeric data are not applicable; that is, a value is neither expected nor measured.
  • -2: when numeric data are not applicable; that is, a value is neither expected nor measured.
  • -9: when the submitted data item does not meet NCES data quality standards; the value is suppressed.

Basically, NA may be too catch-all for a variety of reasons why a data value is missing; in certain situations, an analyst may want to incorporate this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants