Merge branch 'release'

fstpackage · Dec 14, 2018 · b80b522 · b80b522
2 parents 5b83d5e + f6ef649
commit b80b522
Show file tree

Hide file tree

Showing 75 changed files with 6,702 additions and 3,168 deletions.
diff --git a/.gitignore b/.gitignore
@@ -17,3 +17,4 @@
 *.zip
 .Rproj.user
 *.TMP
+/revdep/*
diff --git a/.travis.yml b/.travis.yml
@@ -11,6 +11,18 @@ os:
   - linux
   - osx
 
+matrix:
+  exclude:
+  - r: devel
+    os: osx
+
+before_install:
+  - if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew install llvm &&
+    export PATH="/usr/local/opt/llvm/bin:$PATH" &&
+    export LDFLAGS="-L/usr/local/opt/llvm/lib" &&
+    export CPPFLAGS="-I/usr/local/opt/llvm/include" &&
+    export PKG_CXXFLAGS="-O3 -Wall -pedantic"; fi
+
 r_packages:
   - covr
   - lintr
@@ -19,13 +31,6 @@ r_packages:
   - testthat
   - data.table
 
-matrix:
-  exclude:
-  - r: release
-    os: osx
-  - r: devel
-    os: osx
-
 addons:
   apt:
     update: true
@@ -35,4 +40,4 @@ after_success:
 
 env:
   global:
-    - PKG_CFLAGS="-O3 -Wall -pedantic"
+    - PKG_CXXFLAGS="-O3 -Wall -pedantic"
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -5,8 +5,8 @@ Description: Multithreaded serialization of compressed data frames using the
     'fst' format. The 'fst' format allows for random access of stored data and
     compression with the LZ4 and ZSTD compressors created by Yann Collet. The ZSTD
     compression library is owned by Facebook Inc.
-Version: 0.8.8
-Date: 2018-06-06
+Version: 0.8.10
+Date: 2018-12-13
 Authors@R: c(
     person("Mark", "Klik", email = "markklik@gmail.com", role = c("aut", "cre", "cph")),
     person("Yann", "Collet", role = c("ctb", "cph"),
@@ -19,7 +19,7 @@ Imports:
     Rcpp
 LinkingTo: Rcpp
 SystemRequirements: little-endian platform
-RoxygenNote: 6.0.1
+RoxygenNote: 6.1.1
 Suggests:
     testthat,
     bit64,

diff --git a/NEWS.md b/NEWS.md
@@ -1,4 +1,30 @@
 
+# fst 0.8.9 (in development)
+
+Version 0.8.10 of the `fst` package is an intermediate release designed to update the incorporated C++ libraries
+to their latest versions and to fix reported issues. Also, per request of CRAN maintainers, the OpenMP build option was moved to the correct flag in the Makevars file, resolving a warning in the package check.
+
+## Library updates
+
+* Library `fstlib` updated to version 0.1.0
+
+* Library `ZSTD` updated to version 1.3.7
+
+* Library `LZ4` updated to version 1.8.3
+
+## Bugs solved
+
+* Method `compress_fst()` can handle vectors with sizes larger than 4 GB (issue #176, thanks @bwlewis for reporting)
+
+* A _fst_ file is correctly read from a subfolder on a network drive where the user does not have access to the top-level folder (issues #136 and #175, thanks @xiaodaigh for reporting).
+
+* The suggested data.table dependency is now properly escaped (issue #181, thanks @jangorecki for the pull request)
+
+## Documentation
+
+* Documentation updates (issue #158, thanks @HughParsonage for submitting)
+
+
 # fst 0.8.8 (June 6, 2018)
 
 Version 0.8.8 of the `fst` package is an intermediate release designed to fix valgrind warnings reported on CRAN builds (per request of CRAN maintainers). These warnings were due to `fst` writing uninitialized data buffers to file, which was done to maximize speed. To fix these warnings (and for safety), all memory blocks are now initialized to zero before being written to disk.
@@ -25,6 +51,7 @@ Version 0.8.6 of the `fst` package brings clearer printing of `fst_table` object
 
 * Improved documentation on background threads during `write_fst()` and `read_fst()` (issue #121, thanks @krlmlr for suggestions and discussion)
 
+
 # fst 0.8.4
 
 The v0.8.4 release brings a `data.frame` interface to the `fst` package. Column and row selection can now be done directly from the `[` operator. In addition, it fixes some issues and prepares the package for the next build toolchain of CRAN.

diff --git a/R/RcppExports.R b/R/RcppExports.R
@@ -1,6 +1,10 @@
 # Generated by using Rcpp::compileAttributes() -> do not edit by hand
 # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
 
+fstlib_version <- function() {
+    .Call(`_fst_fstlib_version`)
+}
+
 fststore <- function(fileName, table, compression, uniformEncoding) {
     .Call(`_fst_fststore`, fileName, table, compression, uniformEncoding)
 }

diff --git a/R/fst.R b/R/fst.R
@@ -41,8 +41,8 @@
 #' If `uniform.encoding` is set to `FALSE`, no such assumption will be made and all elements will be converted
 #' to the same encoding. The latter is a relatively expensive operation and will reduce write performance for
 #' character columns.
-#' @return `read_fst` returns a data frame with the selected columns and rows. `read_fst`
-#' invisibly returns `x` (so you can use this function in a pipeline).
+#' @return `read_fst` returns a data frame with the selected columns and rows. `write_fst`
+#' writes `x` to a `fst` file and invisibly returns `x` (so you can use this function in a pipeline).
 #' @examples
 #' # Sample dataset
 #' x <- data.frame(A = 1:10000, B = sample(c(TRUE, FALSE, NA), 10000, replace = TRUE))
@@ -101,7 +101,7 @@ metadata_fst <- function(path, old_format = FALSE) {
     stop("A logical value is expected for parameter 'old_format'.")
   }
 
-  full_path <- normalizePath(path, mustWork = TRUE)
+  full_path <- normalizePath(path, mustWork = FALSE)
 
   metadata <- fstmetadata(full_path, old_format)
 
@@ -154,7 +154,8 @@ print.fstmetadata <- function(x, ...) {
 #' @param from Read data starting from this row number.
 #' @param to Read data up until this row number. The default is to read to the last row of the stored dataset.
 #' @param as.data.table If TRUE, the result will be returned as a \code{data.table} object. Any keys set on
-#' dataset \code{x} before writing will be retained. This allows for storage of sorted datasets.
+#' dataset \code{x} before writing will be retained. This allows for storage of sorted datasets. This option
+#' requires \code{data.table} package to be installed.
 #' @param old_format use TRUE to read fst files generated with a fst package version lower than v0.8.0
 #'
 #' @export
@@ -193,29 +194,26 @@ read_fst <- function(path, columns = NULL, from = 1, to = NULL, as.data.table =
 
 
   if (as.data.table) {
-    if (!requireNamespace("data.table")) {
+    if (!requireNamespace("data.table", quietly = TRUE)) {
       stop("Please install package data.table when using as.data.table = TRUE")
     }
 
     keyNames <- res$keyNames
     res <- data.table::setDT(res$resTable)  # nolint
-    if (length(keyNames) > 0 ) data.table::setattr(res, "sorted", keyNames)
+    if (length(keyNames) > 0) data.table::setattr(res, "sorted", keyNames)
     return(res)
   }
 
-  # use setters from data.table to improve performance
-  if (requireNamespace("data.table")) {
-
-    data.table::setattr(res$resTable, "class", "data.frame")
-    data.table::setattr(res$resTable, "row.names", 1:length(res$resTable[[1]]))
-
-    return(res$resTable)
-  }
-
   res_table <- res$resTable
 
-  class(res_table) <- "data.frame"
-  attr(res_table, "row.names") <- 1:length(res$resTable[[1]])
+  # use setters from data.table to improve performance
+  if (requireNamespace("data.table", quietly = TRUE)) {
+    data.table::setattr(res_table, "class", "data.frame")
+    data.table::setattr(res_table, "row.names", 1:length(res_table[[1L]]))
+  } else {
+    class(res_table) <- "data.frame"
+    attr(res_table, "row.names") <- 1:length(res_table[[1L]])
+  }
 
   res_table
 }

diff --git a/README-multi-threading-1.png b/README-multi-threading-1.png
diff --git a/README-speed-bench-1.png b/README-speed-bench-1.png
diff --git a/README.Rmd b/README.Rmd
@@ -138,7 +138,7 @@ The _fst_ file format provides full random access to stored datasets. You can re
   df_subset <- read.fst("dataset.fst", c("Logical", "Factor"), from = 2000, to = 5000)
 ```
 
-This reads rows 1000 to 5000 from columns _Logical_ and _Factor_ without actually touching any other data in the stored file. That means that a subset can be read from file **without reading the complete file first**. This is different from, say, _readRDS_ or _read\_feather_ where you have to read the complete file or column before you can make a subset.
+This reads rows 2000 to 5000 from columns _Logical_ and _Factor_ without actually touching any other data in the stored file. That means that a subset can be read from file **without reading the complete file first**. This is different from, say, _readRDS_ or _read\_feather_ where you have to read the complete file or column before you can make a subset.
 
 ## Compression
 

diff --git a/README.md b/README.md
@@ -1,18 +1,33 @@
 
 <!-- README.md is generated from README.Rmd. Please edit that file -->
-<img src="logo.png" align="right" />
-
-[![Linux/OSX Build Status](https://travis-ci.org/fstpackage/fst.svg?branch=develop)](https://travis-ci.org/fstpackage/fst) [![WIndows Build status](https://ci.appveyor.com/api/projects/status/6g6kp8onpb26jhnm/branch/develop?svg=true)](https://ci.appveyor.com/project/fstpackage/fst/branch/develop) [![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0) [![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/fst)](https://cran.r-project.org/package=fst) [![codecov](https://codecov.io/gh/fstpackage/fst/branch/develop/graph/badge.svg)](https://codecov.io/gh/fstpackage/fst) [![downloads](http://cranlogs.r-pkg.org/badges/fst)](http://cran.rstudio.com/web/packages/fst/index.html)
-
-Overview
---------
 
-The [*fst* package](https://github.com/fstpackage/fst) for R provides a fast, easy and flexible way to serialize data frames. With access speeds of multiple GB/s, *fst* is specifically designed to unlock the potential of high speed solid state disks that can be found in most modern computers. Data frames stored in the *fst* format have full random access, both in column and rows.
+<img src="logo.png" align="right" />
 
-The figure below compares the read and write performance of the *fst* package to various alternatives.
+[![Linux/OSX Build
+Status](https://travis-ci.org/fstpackage/fst.svg?branch=develop)](https://travis-ci.org/fstpackage/fst)
+[![WIndows Build
+status](https://ci.appveyor.com/api/projects/status/6g6kp8onpb26jhnm/branch/develop?svg=true)](https://ci.appveyor.com/project/fstpackage/fst/branch/develop)
+[![License: AGPL
+v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
+[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/fst)](https://cran.r-project.org/package=fst)
+[![codecov](https://codecov.io/gh/fstpackage/fst/branch/develop/graph/badge.svg)](https://codecov.io/gh/fstpackage/fst)
+[![downloads](http://cranlogs.r-pkg.org/badges/fst)](http://cran.rstudio.com/web/packages/fst/index.html)
+
+## Overview
+
+The [*fst* package](https://github.com/fstpackage/fst) for R provides a
+fast, easy and flexible way to serialize data frames. With access speeds
+of multiple GB/s, *fst* is specifically designed to unlock the potential
+of high speed solid state disks that can be found in most modern
+computers. Data frames stored in the *fst* format have full random
+access, both in column and rows.
+
+The figure below compares the read and write performance of the *fst*
+package to various
+alternatives.
 
 | Method         | Format  | Time (ms) | Size (MB) | Speed (MB/s) | N       |
-|:---------------|:--------|:----------|:----------|:-------------|:--------|
+| :------------- | :------ | :-------- | :-------- | :----------- | :------ |
 | readRDS        | bin     | 1577      | 1000      | 633          | 112     |
 | saveRDS        | bin     | 2042      | 1000      | 489          | 112     |
 | fread          | csv     | 2925      | 1038      | 410          | 232     |
@@ -22,21 +37,34 @@ The figure below compares the read and write performance of the *fst* package to
 | **read\_fst**  | **bin** | **457**   | **303**   | **2184**     | **282** |
 | **write\_fst** | **bin** | **314**   | **303**   | **3180**     | **291** |
 
-These benchmarks were performed on a laptop (i7 4710HQ @2.5 GHz) with a reasonably fast SSD (M.2 Samsung SM951) using the dataset defined below. Parameter *Speed* was calculated by dividing the in-memory size of the data frame by the measured time. These results are also visualized in the following graph:
+These benchmarks were performed on a laptop (i7 4710HQ @2.5 GHz) with a
+reasonably fast SSD (M.2 Samsung SM951) using the dataset defined below.
+Parameter *Speed* was calculated by dividing the in-memory size of the
+data frame by the measured time. These results are also visualized in
+the following graph:
 
-![](README-speed-bench-1.png)
+![](README-speed-bench-1.png)<!-- -->
 
-As can be seen from the figure, the measured speeds for the *fst* package are very high and even top the maximum drive speed of the SSD used. The package accomplishes this by an effective combination of multi-threading and compression. The on-disk file sizes of *fst* files are also much smaller than that of the other formats tested. This is an added benefit of *fst*'s use of type-specific compressors on each stored column.
+As can be seen from the figure, the measured speeds for the *fst*
+package are very high and even top the maximum drive speed of the SSD
+used. The package accomplishes this by an effective combination of
+multi-threading and compression. The on-disk file sizes of *fst* files
+are also much smaller than that of the other formats tested. This is an
+added benefit of *fst*’s use of type-specific compressors on each stored
+column.
 
-In addition to methods for data frame serialization, *fst* also provides methods for multi-threaded in-memory compression with the popular LZ4 and ZSTD compressors and an extremely fast multi-threaded hasher.
+In addition to methods for data frame serialization, *fst* also provides
+methods for multi-threaded in-memory compression with the popular LZ4
+and ZSTD compressors and an extremely fast multi-threaded hasher.
 
-Multi-threading
----------------
+## Multi-threading
 
-The *fst* package relies heavily on multi-threading to boost the read- and write speed of data frames. To maximize throughput, *fst* compresses and decompresses data *in the background* and tries to keep the disk busy writing and reading data at the same time.
+The *fst* package relies heavily on multi-threading to boost the read-
+and write speed of data frames. To maximize throughput, *fst* compresses
+and decompresses data *in the background* and tries to keep the disk
+busy writing and reading data at the same time.
 
-Installation
-------------
+## Installation
 
 The easiest way to install the package is from CRAN:
 
@@ -51,10 +79,11 @@ You can also use the development version from GitHub:
 devtools::install_github("fstPackage/fst", ref = "develop")
 ```
 
-Basic usage
------------
+## Basic usage
 
-Using *fst* is simple. Data can be stored and retrieved using methods *write\_fst* and *read\_fst*:
+Using *fst* is simple. Data can be stored and retrieved using methods
+*write\_fst* and
+*read\_fst*:
 
 ``` r
 # Generate some random data frame with 10 million rows and various column types
@@ -74,37 +103,66 @@ df <- data.frame(
   df <- read.fst("dataset.fst")
 ```
 
-*Note: the dataset defined in this example code was also used to obtain the benchmark results shown in the introduction.*
+*Note: the dataset defined in this example code was also used to obtain
+the benchmark results shown in the introduction.*
 
-Random access
--------------
+## Random access
 
-The *fst* file format provides full random access to stored datasets. You can retrieve a selection of columns and rows with:
+The *fst* file format provides full random access to stored datasets.
+You can retrieve a selection of columns and rows
+with:
 
 ``` r
   df_subset <- read.fst("dataset.fst", c("Logical", "Factor"), from = 2000, to = 5000)
 ```
 
-This reads rows 1000 to 5000 from columns *Logical* and *Factor* without actually touching any other data in the stored file. That means that a subset can be read from file **without reading the complete file first**. This is different from, say, *readRDS* or *read\_feather* where you have to read the complete file or column before you can make a subset.
+This reads rows 2000 to 5000 from columns *Logical* and *Factor* without
+actually touching any other data in the stored file. That means that a
+subset can be read from file **without reading the complete file
+first**. This is different from, say, *readRDS* or *read\_feather* where
+you have to read the complete file or column before you can make a
+subset.
 
-Compression
------------
+## Compression
 
-For compression the excellent and speedy [LZ4](https://github.com/lz4/lz4) and [ZSTD](https://github.com/facebook/zstd) compression algorithms are used. These compressors (in combination with type-specific bit filters), enable *fst* to achieve high compression speeds at reasonable compression factors. The compression factor can be tuned from 0 (minimum) to 100 (maximum):
+For compression the excellent and speedy
+[LZ4](https://github.com/lz4/lz4) and
+[ZSTD](https://github.com/facebook/zstd) compression algorithms are
+used. These compressors (in combination with type-specific bit filters),
+enable *fst* to achieve high compression speeds at reasonable
+compression factors. The compression factor can be tuned from 0
+(minimum) to 100 (maximum):
 
 ``` r
 write.fst(df, "dataset.fst", 100)  # use maximum compression
 ```
 
-Compression reduces the size of the *fst* file that holds your data. But because the (de-)compression is done *on background threads*, it can increase the total read- and write speed as well. The graph below shows how the use of multiple threads enhances the read and write speed of our sample dataset.
-
-![](README-multi-threading-1.png)
-
-The *csv* format used by the *fread* and *fwrite* methods of package *data.table* is actually a human-readable text format and not a binary format. Normally, binary formats would be much faster than the *csv* format, because *csv* takes more space on disk, is row based, uncompressed and needs to be parsed into a computer-native format to have any meaning. So any serializer that's working on *csv* has an enormous disadvantage as compared to binary formats. Yet, the results show that *data.table* is on par with binary formats and when more threads are used, it can even be faster. Because of this impressive performance, it was included in the graph for comparison.
-
-Bindings in other languages
----------------------------
-
-**Julia**: [**`fstformat.jl`**](https://github.com/xiaodaigh/fstformat.jl) A naive Julia binding using RCall.jl
-
-> **Note to users**: From CRAN release v0.8.0, the *fst* format is stable and backwards compatible. That means that all *fst* files generated with package v0.8.0 or later can be read by future versions of the package.
+Compression reduces the size of the *fst* file that holds your data. But
+because the (de-)compression is done *on background threads*, it can
+increase the total read- and write speed as well. The graph below shows
+how the use of multiple threads enhances the read and write speed of our
+sample dataset.
+
+![](README-multi-threading-1.png)<!-- -->
+
+The *csv* format used by the *fread* and *fwrite* methods of package
+*data.table* is actually a human-readable text format and not a binary
+format. Normally, binary formats would be much faster than the *csv*
+format, because *csv* takes more space on disk, is row based,
+uncompressed and needs to be parsed into a computer-native format to
+have any meaning. So any serializer that’s working on *csv* has an
+enormous disadvantage as compared to binary formats. Yet, the results
+show that *data.table* is on par with binary formats and when more
+threads are used, it can even be faster. Because of this impressive
+performance, it was included in the graph for comparison.
+
+## Bindings in other languages
+
+**Julia**:
+[**`fstformat.jl`**](https://github.com/xiaodaigh/fstformat.jl) A naive
+Julia binding using RCall.jl
+
+> **Note to users**: From CRAN release v0.8.0, the *fst* format is
+> stable and backwards compatible. That means that all *fst* files
+> generated with package v0.8.0 or later can be read by future versions
+> of the package.
-Original file line number
+Diff line change
@@ Expand Up / @@ -17,3 +17,4 @@ @@
     *.zip
     .Rproj.user
     *.TMP
+    /revdep/*