diff --git a/NEWS.md b/NEWS.md index 7b282db3d..96bc1137b 100644 --- a/NEWS.md +++ b/NEWS.md @@ -3,11 +3,11 @@ #### POTENTIALLY BREAKING CHANGES - 1. By default all columns are now used by `unique()`, `duplicated()` and `uniqueN()` data.table methods, [#1284](https://github.com/Rdatatable/data.table/issues/1284) and [#1841](https://github.com/Rdatatable/data.table/issues/1841). To restore old behaviour: `options(datatable.old.unique.by.key=TRUE)`. In 1 year this option to restore the old default will be deprecated with warning. In 2 years the option will be removed. Please explicity pass `by=key(DT)` for clarity. Only those relying on the default are affected. 262 CRAN and Bioconductor packages using data.table were checked before release. 9 needed to change and were notified. Any lines of code without test coverage will have been missed by these checks. Any packages not on CRAN or Bioconductor were not checked. + 1. By default all columns are now used by `unique()`, `duplicated()` and `uniqueN()` data.table methods, [#1284](https://github.com/Rdatatable/data.table/issues/1284) and [#1841](https://github.com/Rdatatable/data.table/issues/1841). To restore old behaviour: `options(datatable.old.unique.by.key=TRUE)`. In 1 year this option to restore the old default will be deprecated with warning. In 2 years the option will be removed. Please explicity pass `by=key(DT)` for clarity. Only code that relies on the default is affected. 266 CRAN and Bioconductor packages using data.table were checked before release. 9 needed to change and were notified. Any lines of code without test coverage will have been missed by these checks. Any packages not on CRAN or Bioconductor were not checked. - 2. Added `setDTthreads()` and `getDTthreads()` to control the threads used in data.table functions that are now parallelized with OpenMP (subsetting, `fwrite()` and `fsort()`) on all architectures including Windows. When data.table is used from the parallel package (e.g. `mclapply` as done by 3 CRAN and Bioconductor packages) data.table automatically switches down to one thread to avoid a [deadlock/hang](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58378) when OpenMP is used with fork(); [#1745](https://github.com/Rdatatable/data.table/issues/1745), [#1727](https://github.com/Rdatatable/data.table/issues/1727) thanks to Kontstantinos Tsardounis, Ramon Diaz-Uriarte and Jan Gorecki for testing before release and providing reproducible examples. After `parallel::mclapply` has finished, data.table reverts to the prior `getDTthreads()` state. Tests added and therefore will run every day thanks to CRAN. + 2. A new column is guaranteed with `:=` even when there are no matches or when its RHS is length 0 (e.g. `integer()`, `numeric()`) but not `NULL`. The NA column is created with the same type as the empty RHS. This is for consistency so that whether a new column is added or not does not depend on whether `i` matched to 1 or more rows or not. See [#759](https://github.com/Rdatatable/data.table/issues/759) for further details and examples. - 3. When `j` contains no unquoted variable names (whether column names or not), `with=` is now automatically set to `FALSE`. Thus, `DT[,1]`, `DT[,"someCol"]`, `DT[,c("colA","colB")]` and `DT[,100:109]` now work as we all expect them to; i.e., returning columns, [#1188](https://github.com/Rdatatable/data.table/issues/1188), [#1149](https://github.com/Rdatatable/data.table/issues/1149). Since there are no variable names there is no ambiguity as to what was intended. `DT[,colName1:colName2]` no longer needs `with=FALSE` either since that is also unambiguous; it's a single call to the `:` function so `with=TRUE` could make no sense, despite the presence of unquoted variable names. These changes can be made since nobody can be using the existing behaviour of returning back the literal `j` value since that can never be useful. This provides a new ability and should not break any existing code. Selecting a single column still returns a 1-column data.table (not a vector, unlike `data.frame` by default) for type consistency for code (e.g. within `DT[...][...]` chains) that can sometimes select several columns and sometime one, as has always been the case in data.table and we have no intention to bring back `drop`. In future, `DT[,myCols]` (i.e. a single variable name) will look for `myCols` in calling scope without needing to set `with=FALSE` too, just as a single symbol appearing in `i` does already. The new behaviour can be turned on now by setting the option: `options(datatable.WhenJisSymbolThenCallingScope=TRUE)`. The default is currently `FALSE` to give you time to change your code. In this future state, one way (i.e. `DT[,theColName]`) to select the column as a vector rather than a 1-column data.table will no longer work leaving the two other ways that have always worked remaining (since data.table is still just a `list` after all): `DT[["someCol"]]` and `DT$someCol`. Those base R methods are faster too (when iterated many times) by avoiding the small argument checking overhead inside the more flexible `DT[...]` syntax as has been highlighted in `example(data.table)` for many years. In the next release, `DT[,someCol]` will continue with old current behaviour but start to warn if the new option is not set. Then the default will change to TRUE to nudge you to move forward whilst still retaining a way for you to restore old behaviour for this feature only, whilst still allowing you to benefit from other new features of the latest release without changing your code. Then finally after an estimated 2 years from now, the option will be removed. + 3. When `j` contains no unquoted variable names (whether column names or not), `with=` is now automatically set to `FALSE`. Thus, `DT[,1]`, `DT[,"someCol"]`, `DT[,c("colA","colB")]` and `DT[,100:109]` now work as we all expect them to; i.e., returning columns, [#1188](https://github.com/Rdatatable/data.table/issues/1188), [#1149](https://github.com/Rdatatable/data.table/issues/1149). Since there are no variable names there is no ambiguity as to what was intended. `DT[,colName1:colName2]` no longer needs `with=FALSE` either since that is also unambiguous; it's a single call to the `:` function so `with=TRUE` could make no sense, despite the presence of unquoted variable names. These changes can be made since nobody can be using the existing behaviour of returning back the literal `j` value since that can never be useful. This provides a new ability and should not break any existing code. Selecting a single column still returns a 1-column data.table (not a vector, unlike `data.frame` by default) for type consistency for code (e.g. within `DT[...][...]` chains) that can sometimes select several columns and sometime one, as has always been the case in data.table and we have no intention to bring back `drop`. In future, `DT[,myCols]` (i.e. a single variable name) will look for `myCols` in calling scope without needing to set `with=FALSE` too, just as a single symbol appearing in `i` does already. The new behaviour can be turned on now by setting the tersely named option: `options(datatable.WhenJisSymbolThenCallingScope=TRUE)`. The default is currently `FALSE` to give you time to change your code. In this future state, one way (i.e. `DT[,theColName]`) to select the column as a vector rather than a 1-column data.table will no longer work leaving the two other ways that have always worked remaining (since data.table is still just a `list` after all): `DT[["someCol"]]` and `DT$someCol`. Those base R methods are faster too (when iterated many times) by avoiding the small argument checking overhead inside the more flexible `DT[...]` syntax as has been highlighted in `example(data.table)` for many years. In the next release, `DT[,someCol]` will continue with old current behaviour but start to warn if the new option is not set. Then the default will change to TRUE to nudge you to move forward whilst still retaining a way for you to restore old behaviour for this feature only, whilst still allowing you to benefit from other new features of the latest release without changing your code. Then finally after an estimated 2 years from now, the option will be removed. #### NEW FEATURES @@ -37,9 +37,7 @@ 5. Fast set operations `fsetdiff`, `fintersect`, `funion` and `fsetequal` for data.tables are now implemented, [#547](https://github.com/Rdatatable/data.table/issues/547). - 6. Updating by reference using `:=`: - * New column (of all-NA) is guaranteed even when there are no matches, [#759](https://github.com/Rdatatable/data.table/issues/759). Thanks @mchen402. - * `:=` also adds new columns when its RHS is length `0`. An all-NA column is created of the same type as the empty RHS. + 6. Added `setDTthreads()` and `getDTthreads()` to control the threads used in data.table functions that are now parallelized with OpenMP on all architectures including Windows (`fwrite()`, `fsort()` and subsetting). Extra code was required internally to ensure these control data.table only and not other packages using OpenMP. When data.table is used from the parallel package (e.g. `mclapply` as done by 3 CRAN and Bioconductor packages) data.table automatically switches down to one thread to avoid a [deadlock/hang](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58378) when OpenMP is used with fork(); [#1745](https://github.com/Rdatatable/data.table/issues/1745) and [#1727](https://github.com/Rdatatable/data.table/issues/1727). Thanks to Kontstantinos Tsardounis, Ramon Diaz-Uriarte and Jan Gorecki for testing before release and providing reproducible examples. After `parallel::mclapply` has finished, data.table reverts to the prior `getDTthreads()` state. Tests added which will therefore will run every day thanks to CRAN (limited to 2 threads on CRAN which is enough to test). 7. `GForce` (See ?\`datatable-optimize\` for more): * `dt[, .N, by=cols]` is optimised internally as well, [#1251](https://github.com/Rdatatable/data.table/issues/1251).