Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

na.strings getOption() added so ,, can be read as NA by default in future #2652

Merged
merged 19 commits into from
Mar 20, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 28 additions & 7 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,26 @@

### Changes in v1.10.5 ( in development )

#### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES

1. `fread()`'s `na.strings=` argument :
```
"NA" # old default
getOption("datatable.na.strings", "NA") # this release; i.e. the same; no change yet
getOption("datatable.na.strings", "") # future release
```
This option controls how `,,` is read in character columns. It does not affect numeric columns which read `,,` as `NA` regardless. We would like `,,`=>`NA` for consistency with numeric types, and `,"",`=>empty string to be the standard default for `fwrite/fread` character columns so that `fread(fwrite(DT))==DT` without needing any change to any parameters. `fwrite` has never written `NA` as `"NA"` in case `"NA"` is a valid string in the data; e.g., 2 character id columns sometimes do. Instead, `fwrite` has always written `,,` by default for an `<NA>` in a character columns. The use of R's `getOption()` allows users to move forward now, using `options(datatable.fread.na.strings="")`, or restore old behaviour when the default's default is changed in future, using `options(datatable.fread.na.strings="NA")`.

2. `fread()` and `fwrite()`'s `logical01=` argument :
```
logical01 = FALSE # old default
getOption("datatable.logical01", FALSE) # this release; i.e. the same; no change yet
getOption("datatable.logical01", TRUE) # future release
```
This option controls whether a column of all 0's and 1's is read as `integer`, or `logical` directly to avoid needing to change the type afterwards to `logical` or use `colClasses`. `0/1` is smaller and faster than `"TRUE"/"FALSE"`, which can make a significant difference to space and time the more `logical` columns there are. When the default's default changes to `TRUE` for `fread` we do not expect much impact since all arithmetic operators that are currently receiving 0's and 1's as type `integer` (think `sum()`) but instead could receive `logical`, would return exactly the same result on the 0's and 1's as `logical` type. However, code that is manipulating column types using `is.integer` or `is.logical` on `fread`'s result, could require change. It could be painful if `DT[(logical_column)]` (i.e. `DT[logical_column==TRUE]`) changed behaviour due to `logical_column` no longer being type `logical` but `integer`. But that is not the change proposed. The change is the other way around; i.e., a previously `integer` column holding only 0's and 1's would now be type `logical`. Since it's that way around, we believe the scope for breakage is limited. We think a lot of code is converting 0/1 integer columns to logical anyway, either using `colClasses=` or afterwards with an assign. For `fwrite`, the level of breakage depends on the consumer of the output file. We believe `0/1` is a better more standard default choice to move to. See notes below about improvements to `fread`'s sampling for type guessing, and automatic rereading in the rare cases of out-of-sample type surprises.

These options are meant for temporary use to aid your migration, [#2652](https://github.com/Rdatatable/data.table/pull/2652). You are not meant to set them to the old default and then not migrate your code that is dependent on the default. Either set the argument explicitly so your code is not dependent on the default, or change the code to cope with the new default. Over the next few years we will slowly start to remove these options, warning you if you are using them, and return to a simple default. See the history of NEWS and NEWS.0 for past migrations that have, generally speaking, been successfully managed in this way. For example, at the end of NOTES for this version (below in this file) is a note about the usage of `datatable.old.unique.by.key` now warning, as you were warned it would do over a year ago. When that change was introduced, the default was changed and that option provided an option to restore the old behaviour. These `fread`/`fwrite` changes are even more cautious and not even changing the default's default yet. Giving you extra warning by way of this notice to move forward. And giving you a chance to object.

#### NEW FEATURES

1. `fread()`:
Expand All @@ -17,25 +37,24 @@
* `dec=','` is now implemented directly so there is no dependency on locale. The options `datatable.fread.dec.experiment` and `datatable.fread.dec.locale` have been removed.
* `\\r\\r\\n` line endings are now handled such as produced by `base::download.file()` when it doubles up `\\r`. Other rare line endings (`\\r` and `\\n\\r`) are now more robust.
* Mixed line endings are now handled; e.g. a file formed by concatenating a Unix file and a Windows file so that some lines end with `\\n` while others end with `\\r\\n`.
* Improved automatic detection of whether the first row is column names by comparing the types of the fields on the first and second row.
* Improved automatic detection of whether the first row is column names by comparing the types of the fields on the first row against the column types ascertained by the 10,000 rows sample (or `colClasses` if provided). If a numeric column has a string value at the top, then column names are deemed present.
* Detects GB-18030 and UTF-16 encodings and in verbose mode prints a message about BOM detection.
* Detects and ignores trailing ^Z end-of-file control character sometimes created on MS DOS/Windows, [#1612](https://github.com/Rdatatable/data.table/issues/1612). Thanks to Gergely Daróczi for reporting and providing a file.
* Added option `logical01` to read a column of only `0`s and `1`s as `logical`, default `TRUE` for convenience in most cases. The large sample of rows throughout the file means that `fread` will be confident that the column really does just contain `0`s and `1`s, enabling and encouraging this convenient and efficient choice to save needing conversion afterwards or setting `colClasses` manually. In R, `logical` is `integer` anyway and can be treated as such in calculations. Further, it is no longer allowed to have mixed-case literals within a single column; i.e., a column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing different styles together is not.
* Added ability to recognize and parse hexadecimal floating point numbers, as used for example in Java. Thanks for @scottstanfield [#2316](https://github.com/Rdatatable/data.table/issues/2316) for the report.
* Now handles floating-point NaN values in a wide variety of formats, including `NaN`, `sNaN`, `1.#QNAN`, `NaN1234`, `#NUM!` and others, [#1800](https://github.com/Rdatatable/data.table/issues/1800). Thanks to Jori Liesenborgs for highlighting and the PR.
* If negative numbers are passed to `select=` the out-of-range error now suggests `drop=` instead, [#2423](https://github.com/Rdatatable/data.table/issues/2423). Thanks to Michael Chirico for the suggestion.
* `sep=NULL` or `sep=""` (i.e., no column separator) can now be used to specify single column input reliably like `base::readLines`, [#1616](https://github.com/Rdatatable/data.table/issues/1616). `sep='\\n'` still works (even on Windows where line ending is actually `\\r\\n`) but `NULL` or `""` are now documented and recommended. Thanks to Dmitriy Selivanov for the pull request and many others for comments. As before, `sep=NA` is not valid; use the default `"auto"` for automatic separator detection. `sep='\\n'` may be deprecated in future.
* Single-column input with blank lines is now valid and the blank lines are significant (meaning an NA in the single column). The blank lines are significant even at the very end, which may be surprising on first glance. The change is so that `fread(fwrite(DT))==DT` for single-column inputs containing NA which are written as blank. There is no change when `ncol>1` (i.e., input stops with detailed warning at the first blank line) because a blank line when `ncol>1` is invalid input due to no separators present instead of `ncol-1` separators.
* `sep=NULL` or `sep=""` (i.e., no column separator) can now be used to specify single column input reliably like `base::readLines`, [#1616](https://github.com/Rdatatable/data.table/issues/1616). `sep='\\n'` still works (even on Windows where line ending is actually `\\r\\n`) but `NULL` or `""` are now documented and recommended. Thanks to Dmitriy Selivanov for the pull request and many others for comments. As before, `sep=NA` is not valid; use the default `"auto"` for automatic separator detection. `sep='\\n'` is now deprecated and in future will start to warn when used.
* Single-column input with blank lines is now valid and the blank lines are significant (representing `NA`). The blank lines are significant even at the very end, which may be surprising on first glance. The change is so that `fread(fwrite(DT))==DT` for single-column inputs containing `NA` which are written as blank. There is no change when `ncol>1`; i.e., input stops with detailed warning at the first blank line, because a blank line when `ncol>1` is invalid input due to no separators being present. Thanks to @skanskan, Michael Chirico, @franknarf1 and Pasha for the testing and discussions, [#2106](https://github.com/Rdatatable/data.table/issues/2106).
* Too few column names are now auto filled with default column names, with warning, [#1625](https://github.com/Rdatatable/data.table/issues/1625). If there is just one missing column name it is guessed to be for the first column (row names or an index), otherwise the column names are filled at the end. Similarly, too many column names now automatically sets `fill=TRUE`, with warning.
* `skip=` and `nrow=` are more reliable and no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, [#1267](https://github.com/Rdatatable/data.table/issues/1267). Tests added.
* `skip=` and `nrow=` are more reliable and are no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, [#1267](https://github.com/Rdatatable/data.table/issues/1267).
* Ram disk (`/dev/shm`) is no longer used for the output of system command input. Although faster when it worked, it was causing too many device full errors; e.g., [#1139](https://github.com/Rdatatable/data.table/issues/1139) and [zUMIs/19](https://github.com/sdparekh/zUMIs/issues/19). Thanks to Kyle Chung for reporting. Standard `tempdir()` is now used. If you wish to use ram disk, set TEMPDIR to `/dev/shm`; see `?tempdir`.
* Detecting whether a very long input string is a file name or data is now much faster, [#2531](https://github.com/Rdatatable/data.table/issues/2531). Many thanks to @javrucebo for the detailed report, benchmarks and suggestions.
* A column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing styles (e.g. `TRUE/false`) is not and will be read as type `character`.
* Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik, Pasha Stetsenko, Mahyar K, Tom Crockett, @cnoelke, @qinjs, @etienne-s, Mark Danese, Avraham Adler, @franknarf1, @MichaelChirico, @tdhock, Luke Tierney for testing dev and reporting these regressions before release to CRAN: #2070, #2073, #2087, #2091, #2107, #2118, #2092, #1888, #2123, #2167, #2194, #2238, #2228, #1464, #2201, #2287, #2299, #2285, #2251, #2347, #2222, #2352, #2246, #2370, #2371, #2404, #2196, #2322, #2453, #2446, #2464, #2457, #1895, #2481, #2499, #2516, #2520, #2512, #2523, #2542, #2526, #2518, #2515, #1671, #2267, #2561, #2625, #2265, #2548, #2535

2. `fwrite()`:
* empty strings are now always quoted (`,"",`) to distinguish them from `NA` which by default is still empty (`,,`) but can be changed using `na=` as before. If `na=` is provided and `quote=` is the default `'auto'` then `quote=` is set to `TRUE` so that if the `na=` value occurs in the data, it can be distinguished from `NA`. Thanks to Ethan Welty for the request [#2214](https://github.com/Rdatatable/data.table/issues/2214) and Pasha for the code change and tests, [#2215](https://github.com/Rdatatable/data.table/issues/2215).
* `logicalAsInt` has been renamed `logical01` and the default changed from `FALSE` to `TRUE`, both changes for consistency with `fread` (see item above). The old name `logicalAsInt` continues to work but is now deprecated. The previous default can easily be restored without any code changes by setting `options("datatable.logical01" = FALSE)`.
* When `DT` is a single column, `na=` is now set to `"NA"` to avoid blank lines in the output, [#2106](https://github.com/Rdatatable/data.table/issues/2106). Thanks to @skanskan, Michael Chirico and @franknarf1 for the testing and ideas.
* `logical01` has been added and the old name `logicalAsInt` retained. Pease move to the new name when convenient for you. The old argument name (`logicalAsInt`) will slowly be deprecated over the next few years. The default is unchanged: `FALSE`, so `logical` is still written as `"TRUE"`/`"FALSE"` in full by default. We intend to change the default's default in future to `TRUE`; see the notice at the top of these release notes.

3. Added helpful message when subsetting by a logical column without wrapping it in parentheses, [#1844](https://github.com/Rdatatable/data.table/issues/1844). Thanks @dracodoc for the suggestion and @MichaelChirico for the PR.

Expand Down Expand Up @@ -158,6 +177,8 @@ Thanks to @sritchie73 for reporting and fixing [PR#2631](https://github.com/Rdat

35. `CJ()` now fails with proper error message when results would exceed max integer, [#2636](https://github.com/Rdatatable/data.table/issues/2636).

36. `NA` in character columns now display as `<NA>` just like base R to distinguish from `""` and `"NA"`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice, no need for quote = TRUE argument by default 👍


#### NOTES

0. The license has been changed from GPL to MPL (Mozilla Public License). All contributors were consulted and approved. [PR#2456](https://github.com/Rdatatable/data.table/pull/2456) details the reasons for the change.
Expand Down
2 changes: 1 addition & 1 deletion R/fread.R
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings="NA",stringsAsFactors=FALSE,verbose=getOption("datatable.verbose"),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(),data.table=getOption("datatable.fread.datatable"),nThread=getDTthreads(),logical01=TRUE,autostart=NA)
fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings=getOption("datatable.na.strings","NA"),stringsAsFactors=FALSE,verbose=getOption("datatable.verbose",FALSE),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64","integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, showProgress=interactive(), data.table=getOption("datatable.fread.datatable",TRUE), nThread=getDTthreads(), logical01=getOption("datatable.logical01", FALSE), autostart=NA)
{
if (is.null(sep)) sep="\n" # C level knows that \n means \r\n on Windows, for example
else {
Expand Down
4 changes: 2 additions & 2 deletions R/fwrite.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
fwrite <- function(x, file="", append=FALSE, quote="auto",
sep=",", sep2=c("","|",""), eol=if (.Platform$OS.type=="windows") "\r\n" else "\n",
na=if (length(x)>1L) "" else "NA", dec=".", row.names=FALSE, col.names=TRUE,
na="", dec=".", row.names=FALSE, col.names=TRUE,
qmethod=c("double","escape"),
logical01=getOption("datatable.logical01", TRUE),
logical01=getOption("datatable.logical01", FALSE), # due to change to TRUE; see NEWS
logicalAsInt=logical01,
dateTimeAs = c("ISO","squash","epoch","write.csv"),
buffMB=8, nThread=getDTthreads(),
Expand Down
9 changes: 4 additions & 5 deletions R/onLoad.R
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,9 @@
lockBinding("rbind.data.frame",baseenv())
}
# Set options for the speed boost in v1.8.0 by avoiding 'default' arg of getOption(,default=)
# TODO: submit improvement to .Internal(getOption(x)) in base::getOption to return NULL when option not set, to avoid (relatively slow) 'x %in% names(options())' there.
# In fread and fwrite we have moved back to using getOption's default argument since it is unlikely fread and fread will be called in a loop many times, plus they
# are relatively heavy functions where the overhead in getOption() would not be noticed. It's only really [.data.table where getOption default bit.
# Improvement to base::getOption() now submitted (100x; 5s down to 0.05s): https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17394
opts = c("datatable.verbose"="FALSE", # datatable.<argument name>
"datatable.nomatch"="NA_integer_", # datatable.<argument name>
"datatable.optimize"="Inf", # datatable.<argument name>
Expand All @@ -43,13 +45,10 @@
"datatable.dfdispatchwarn"="TRUE", # not a function argument
"datatable.warnredundantby"="TRUE", # not a function argument
"datatable.alloccol"="1024L", # argument 'n' of alloc.col. Over-allocate 1024 spare column slots
"datatable.integer64"="'integer64'", # datatable.<argument name> integer64|double|character
"datatable.auto.index"="TRUE", # DT[col=="val"] to auto add index so 2nd time faster
"datatable.use.index"="TRUE", # global switch to address #1422
"datatable.fread.datatable"="TRUE",
"datatable.prettyprint.char" = NULL, # FR #1091
"datatable.old.unique.by.key" = "FALSE", # TODO: change warnings in duplicated.R to error on or after Jan 2019 then remove in Jan 2020.
"datatable.logical01" = "TRUE" # fwrite/fread to revert to FALSE. TODO: warn in next release and remove after 1 year
"datatable.old.unique.by.key" = "FALSE" # TODO: change warnings in duplicated.R to error on or after Jan 2019 then remove in Jan 2020.
)
for (i in setdiff(names(opts),names(options()))) {
eval(parse(text=paste("options(",i,"=",opts[i],")",sep="")))
Expand Down
2 changes: 1 addition & 1 deletion R/print.data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ print.data.table <- function(x, topn=getOption("datatable.print.topn"),
rn = seq_len(nrow(x))
printdots = FALSE
}
toprint=format.data.table(toprint, ...)
toprint=format.data.table(toprint, na.encode=FALSE, ...) # na.encode=FALSE so that NA in character cols print as <NA>

if ((!"bit64" %chin% loadedNamespaces()) && any(sapply(x,inherits,"integer64"))) require_bit64()
# When we depend on R 3.2.0 (Apr 2015) we can use isNamespaceLoaded() added then, instead of %chin% above
Expand Down
2 changes: 2 additions & 0 deletions R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,5 @@ vapply_1i <- function (x, fun, ..., use.names = TRUE) {
vapply(X = x, FUN = fun, ..., FUN.VALUE = NA_integer_, USE.NAMES = use.names)
}

more = function(f) system(paste("more",f)) # nocov (just a dev helper)

Loading