Skip to content

Commit

Permalink
Merge branch 'release'
Browse files Browse the repository at this point in the history
  • Loading branch information
MarcusKlik committed Dec 14, 2018
2 parents 5b83d5e + f6ef649 commit b80b522
Show file tree
Hide file tree
Showing 75 changed files with 6,702 additions and 3,168 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@
*.zip
.Rproj.user
*.TMP
/revdep/*
21 changes: 13 additions & 8 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,18 @@ os:
- linux
- osx

matrix:
exclude:
- r: devel
os: osx

before_install:
- if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew install llvm &&
export PATH="/usr/local/opt/llvm/bin:$PATH" &&
export LDFLAGS="-L/usr/local/opt/llvm/lib" &&
export CPPFLAGS="-I/usr/local/opt/llvm/include" &&
export PKG_CXXFLAGS="-O3 -Wall -pedantic"; fi

r_packages:
- covr
- lintr
Expand All @@ -19,13 +31,6 @@ r_packages:
- testthat
- data.table

matrix:
exclude:
- r: release
os: osx
- r: devel
os: osx

addons:
apt:
update: true
Expand All @@ -35,4 +40,4 @@ after_success:

env:
global:
- PKG_CFLAGS="-O3 -Wall -pedantic"
- PKG_CXXFLAGS="-O3 -Wall -pedantic"
6 changes: 3 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ Description: Multithreaded serialization of compressed data frames using the
'fst' format. The 'fst' format allows for random access of stored data and
compression with the LZ4 and ZSTD compressors created by Yann Collet. The ZSTD
compression library is owned by Facebook Inc.
Version: 0.8.8
Date: 2018-06-06
Version: 0.8.10
Date: 2018-12-13
Authors@R: c(
person("Mark", "Klik", email = "markklik@gmail.com", role = c("aut", "cre", "cph")),
person("Yann", "Collet", role = c("ctb", "cph"),
Expand All @@ -19,7 +19,7 @@ Imports:
Rcpp
LinkingTo: Rcpp
SystemRequirements: little-endian platform
RoxygenNote: 6.0.1
RoxygenNote: 6.1.1
Suggests:
testthat,
bit64,
Expand Down
27 changes: 27 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,30 @@

# fst 0.8.9 (in development)

Version 0.8.10 of the `fst` package is an intermediate release designed to update the incorporated C++ libraries
to their latest versions and to fix reported issues. Also, per request of CRAN maintainers, the OpenMP build option was moved to the correct flag in the Makevars file, resolving a warning in the package check.

## Library updates

* Library `fstlib` updated to version 0.1.0

* Library `ZSTD` updated to version 1.3.7

* Library `LZ4` updated to version 1.8.3

## Bugs solved

* Method `compress_fst()` can handle vectors with sizes larger than 4 GB (issue #176, thanks @bwlewis for reporting)

* A _fst_ file is correctly read from a subfolder on a network drive where the user does not have access to the top-level folder (issues #136 and #175, thanks @xiaodaigh for reporting).

* The suggested data.table dependency is now properly escaped (issue #181, thanks @jangorecki for the pull request)

## Documentation

* Documentation updates (issue #158, thanks @HughParsonage for submitting)


# fst 0.8.8 (June 6, 2018)

Version 0.8.8 of the `fst` package is an intermediate release designed to fix valgrind warnings reported on CRAN builds (per request of CRAN maintainers). These warnings were due to `fst` writing uninitialized data buffers to file, which was done to maximize speed. To fix these warnings (and for safety), all memory blocks are now initialized to zero before being written to disk.
Expand All @@ -25,6 +51,7 @@ Version 0.8.6 of the `fst` package brings clearer printing of `fst_table` object

* Improved documentation on background threads during `write_fst()` and `read_fst()` (issue #121, thanks @krlmlr for suggestions and discussion)


# fst 0.8.4

The v0.8.4 release brings a `data.frame` interface to the `fst` package. Column and row selection can now be done directly from the `[` operator. In addition, it fixes some issues and prepares the package for the next build toolchain of CRAN.
Expand Down
4 changes: 4 additions & 0 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

fstlib_version <- function() {
.Call(`_fst_fstlib_version`)
}

fststore <- function(fileName, table, compression, uniformEncoding) {
.Call(`_fst_fststore`, fileName, table, compression, uniformEncoding)
}
Expand Down
32 changes: 15 additions & 17 deletions R/fst.R
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@
#' If `uniform.encoding` is set to `FALSE`, no such assumption will be made and all elements will be converted
#' to the same encoding. The latter is a relatively expensive operation and will reduce write performance for
#' character columns.
#' @return `read_fst` returns a data frame with the selected columns and rows. `read_fst`
#' invisibly returns `x` (so you can use this function in a pipeline).
#' @return `read_fst` returns a data frame with the selected columns and rows. `write_fst`
#' writes `x` to a `fst` file and invisibly returns `x` (so you can use this function in a pipeline).
#' @examples
#' # Sample dataset
#' x <- data.frame(A = 1:10000, B = sample(c(TRUE, FALSE, NA), 10000, replace = TRUE))
Expand Down Expand Up @@ -101,7 +101,7 @@ metadata_fst <- function(path, old_format = FALSE) {
stop("A logical value is expected for parameter 'old_format'.")
}

full_path <- normalizePath(path, mustWork = TRUE)
full_path <- normalizePath(path, mustWork = FALSE)

metadata <- fstmetadata(full_path, old_format)

Expand Down Expand Up @@ -154,7 +154,8 @@ print.fstmetadata <- function(x, ...) {
#' @param from Read data starting from this row number.
#' @param to Read data up until this row number. The default is to read to the last row of the stored dataset.
#' @param as.data.table If TRUE, the result will be returned as a \code{data.table} object. Any keys set on
#' dataset \code{x} before writing will be retained. This allows for storage of sorted datasets.
#' dataset \code{x} before writing will be retained. This allows for storage of sorted datasets. This option
#' requires \code{data.table} package to be installed.
#' @param old_format use TRUE to read fst files generated with a fst package version lower than v0.8.0
#'
#' @export
Expand Down Expand Up @@ -193,29 +194,26 @@ read_fst <- function(path, columns = NULL, from = 1, to = NULL, as.data.table =


if (as.data.table) {
if (!requireNamespace("data.table")) {
if (!requireNamespace("data.table", quietly = TRUE)) {
stop("Please install package data.table when using as.data.table = TRUE")
}

keyNames <- res$keyNames
res <- data.table::setDT(res$resTable) # nolint
if (length(keyNames) > 0 ) data.table::setattr(res, "sorted", keyNames)
if (length(keyNames) > 0) data.table::setattr(res, "sorted", keyNames)
return(res)
}

# use setters from data.table to improve performance
if (requireNamespace("data.table")) {

data.table::setattr(res$resTable, "class", "data.frame")
data.table::setattr(res$resTable, "row.names", 1:length(res$resTable[[1]]))

return(res$resTable)
}

res_table <- res$resTable

class(res_table) <- "data.frame"
attr(res_table, "row.names") <- 1:length(res$resTable[[1]])
# use setters from data.table to improve performance
if (requireNamespace("data.table", quietly = TRUE)) {
data.table::setattr(res_table, "class", "data.frame")
data.table::setattr(res_table, "row.names", 1:length(res_table[[1L]]))
} else {
class(res_table) <- "data.frame"
attr(res_table, "row.names") <- 1:length(res_table[[1L]])
}

res_table
}
Expand Down
Binary file modified README-multi-threading-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified README-speed-bench-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ The _fst_ file format provides full random access to stored datasets. You can re
df_subset <- read.fst("dataset.fst", c("Logical", "Factor"), from = 2000, to = 5000)
```

This reads rows 1000 to 5000 from columns _Logical_ and _Factor_ without actually touching any other data in the stored file. That means that a subset can be read from file **without reading the complete file first**. This is different from, say, _readRDS_ or _read\_feather_ where you have to read the complete file or column before you can make a subset.
This reads rows 2000 to 5000 from columns _Logical_ and _Factor_ without actually touching any other data in the stored file. That means that a subset can be read from file **without reading the complete file first**. This is different from, say, _readRDS_ or _read\_feather_ where you have to read the complete file or column before you can make a subset.

## Compression

Expand Down
140 changes: 99 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,33 @@

<!-- README.md is generated from README.Rmd. Please edit that file -->
<img src="logo.png" align="right" />

[![Linux/OSX Build Status](https://travis-ci.org/fstpackage/fst.svg?branch=develop)](https://travis-ci.org/fstpackage/fst) [![WIndows Build status](https://ci.appveyor.com/api/projects/status/6g6kp8onpb26jhnm/branch/develop?svg=true)](https://ci.appveyor.com/project/fstpackage/fst/branch/develop) [![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0) [![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/fst)](https://cran.r-project.org/package=fst) [![codecov](https://codecov.io/gh/fstpackage/fst/branch/develop/graph/badge.svg)](https://codecov.io/gh/fstpackage/fst) [![downloads](http://cranlogs.r-pkg.org/badges/fst)](http://cran.rstudio.com/web/packages/fst/index.html)

Overview
--------

The [*fst* package](https://github.com/fstpackage/fst) for R provides a fast, easy and flexible way to serialize data frames. With access speeds of multiple GB/s, *fst* is specifically designed to unlock the potential of high speed solid state disks that can be found in most modern computers. Data frames stored in the *fst* format have full random access, both in column and rows.
<img src="logo.png" align="right" />

The figure below compares the read and write performance of the *fst* package to various alternatives.
[![Linux/OSX Build
Status](https://travis-ci.org/fstpackage/fst.svg?branch=develop)](https://travis-ci.org/fstpackage/fst)
[![WIndows Build
status](https://ci.appveyor.com/api/projects/status/6g6kp8onpb26jhnm/branch/develop?svg=true)](https://ci.appveyor.com/project/fstpackage/fst/branch/develop)
[![License: AGPL
v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/fst)](https://cran.r-project.org/package=fst)
[![codecov](https://codecov.io/gh/fstpackage/fst/branch/develop/graph/badge.svg)](https://codecov.io/gh/fstpackage/fst)
[![downloads](http://cranlogs.r-pkg.org/badges/fst)](http://cran.rstudio.com/web/packages/fst/index.html)

## Overview

The [*fst* package](https://github.com/fstpackage/fst) for R provides a
fast, easy and flexible way to serialize data frames. With access speeds
of multiple GB/s, *fst* is specifically designed to unlock the potential
of high speed solid state disks that can be found in most modern
computers. Data frames stored in the *fst* format have full random
access, both in column and rows.

The figure below compares the read and write performance of the *fst*
package to various
alternatives.

| Method | Format | Time (ms) | Size (MB) | Speed (MB/s) | N |
|:---------------|:--------|:----------|:----------|:-------------|:--------|
| :------------- | :------ | :-------- | :-------- | :----------- | :------ |
| readRDS | bin | 1577 | 1000 | 633 | 112 |
| saveRDS | bin | 2042 | 1000 | 489 | 112 |
| fread | csv | 2925 | 1038 | 410 | 232 |
Expand All @@ -22,21 +37,34 @@ The figure below compares the read and write performance of the *fst* package to
| **read\_fst** | **bin** | **457** | **303** | **2184** | **282** |
| **write\_fst** | **bin** | **314** | **303** | **3180** | **291** |

These benchmarks were performed on a laptop (i7 4710HQ @2.5 GHz) with a reasonably fast SSD (M.2 Samsung SM951) using the dataset defined below. Parameter *Speed* was calculated by dividing the in-memory size of the data frame by the measured time. These results are also visualized in the following graph:
These benchmarks were performed on a laptop (i7 4710HQ @2.5 GHz) with a
reasonably fast SSD (M.2 Samsung SM951) using the dataset defined below.
Parameter *Speed* was calculated by dividing the in-memory size of the
data frame by the measured time. These results are also visualized in
the following graph:

![](README-speed-bench-1.png)
![](README-speed-bench-1.png)<!-- -->

As can be seen from the figure, the measured speeds for the *fst* package are very high and even top the maximum drive speed of the SSD used. The package accomplishes this by an effective combination of multi-threading and compression. The on-disk file sizes of *fst* files are also much smaller than that of the other formats tested. This is an added benefit of *fst*'s use of type-specific compressors on each stored column.
As can be seen from the figure, the measured speeds for the *fst*
package are very high and even top the maximum drive speed of the SSD
used. The package accomplishes this by an effective combination of
multi-threading and compression. The on-disk file sizes of *fst* files
are also much smaller than that of the other formats tested. This is an
added benefit of *fst*’s use of type-specific compressors on each stored
column.

In addition to methods for data frame serialization, *fst* also provides methods for multi-threaded in-memory compression with the popular LZ4 and ZSTD compressors and an extremely fast multi-threaded hasher.
In addition to methods for data frame serialization, *fst* also provides
methods for multi-threaded in-memory compression with the popular LZ4
and ZSTD compressors and an extremely fast multi-threaded hasher.

Multi-threading
---------------
## Multi-threading

The *fst* package relies heavily on multi-threading to boost the read- and write speed of data frames. To maximize throughput, *fst* compresses and decompresses data *in the background* and tries to keep the disk busy writing and reading data at the same time.
The *fst* package relies heavily on multi-threading to boost the read-
and write speed of data frames. To maximize throughput, *fst* compresses
and decompresses data *in the background* and tries to keep the disk
busy writing and reading data at the same time.

Installation
------------
## Installation

The easiest way to install the package is from CRAN:

Expand All @@ -51,10 +79,11 @@ You can also use the development version from GitHub:
devtools::install_github("fstPackage/fst", ref = "develop")
```

Basic usage
-----------
## Basic usage

Using *fst* is simple. Data can be stored and retrieved using methods *write\_fst* and *read\_fst*:
Using *fst* is simple. Data can be stored and retrieved using methods
*write\_fst* and
*read\_fst*:

``` r
# Generate some random data frame with 10 million rows and various column types
Expand All @@ -74,37 +103,66 @@ df <- data.frame(
df <- read.fst("dataset.fst")
```

*Note: the dataset defined in this example code was also used to obtain the benchmark results shown in the introduction.*
*Note: the dataset defined in this example code was also used to obtain
the benchmark results shown in the introduction.*

Random access
-------------
## Random access

The *fst* file format provides full random access to stored datasets. You can retrieve a selection of columns and rows with:
The *fst* file format provides full random access to stored datasets.
You can retrieve a selection of columns and rows
with:

``` r
df_subset <- read.fst("dataset.fst", c("Logical", "Factor"), from = 2000, to = 5000)
```

This reads rows 1000 to 5000 from columns *Logical* and *Factor* without actually touching any other data in the stored file. That means that a subset can be read from file **without reading the complete file first**. This is different from, say, *readRDS* or *read\_feather* where you have to read the complete file or column before you can make a subset.
This reads rows 2000 to 5000 from columns *Logical* and *Factor* without
actually touching any other data in the stored file. That means that a
subset can be read from file **without reading the complete file
first**. This is different from, say, *readRDS* or *read\_feather* where
you have to read the complete file or column before you can make a
subset.

Compression
-----------
## Compression

For compression the excellent and speedy [LZ4](https://github.com/lz4/lz4) and [ZSTD](https://github.com/facebook/zstd) compression algorithms are used. These compressors (in combination with type-specific bit filters), enable *fst* to achieve high compression speeds at reasonable compression factors. The compression factor can be tuned from 0 (minimum) to 100 (maximum):
For compression the excellent and speedy
[LZ4](https://github.com/lz4/lz4) and
[ZSTD](https://github.com/facebook/zstd) compression algorithms are
used. These compressors (in combination with type-specific bit filters),
enable *fst* to achieve high compression speeds at reasonable
compression factors. The compression factor can be tuned from 0
(minimum) to 100 (maximum):

``` r
write.fst(df, "dataset.fst", 100) # use maximum compression
```

Compression reduces the size of the *fst* file that holds your data. But because the (de-)compression is done *on background threads*, it can increase the total read- and write speed as well. The graph below shows how the use of multiple threads enhances the read and write speed of our sample dataset.

![](README-multi-threading-1.png)

The *csv* format used by the *fread* and *fwrite* methods of package *data.table* is actually a human-readable text format and not a binary format. Normally, binary formats would be much faster than the *csv* format, because *csv* takes more space on disk, is row based, uncompressed and needs to be parsed into a computer-native format to have any meaning. So any serializer that's working on *csv* has an enormous disadvantage as compared to binary formats. Yet, the results show that *data.table* is on par with binary formats and when more threads are used, it can even be faster. Because of this impressive performance, it was included in the graph for comparison.

Bindings in other languages
---------------------------

**Julia**: [**`fstformat.jl`**](https://github.com/xiaodaigh/fstformat.jl) A naive Julia binding using RCall.jl

> **Note to users**: From CRAN release v0.8.0, the *fst* format is stable and backwards compatible. That means that all *fst* files generated with package v0.8.0 or later can be read by future versions of the package.
Compression reduces the size of the *fst* file that holds your data. But
because the (de-)compression is done *on background threads*, it can
increase the total read- and write speed as well. The graph below shows
how the use of multiple threads enhances the read and write speed of our
sample dataset.

![](README-multi-threading-1.png)<!-- -->

The *csv* format used by the *fread* and *fwrite* methods of package
*data.table* is actually a human-readable text format and not a binary
format. Normally, binary formats would be much faster than the *csv*
format, because *csv* takes more space on disk, is row based,
uncompressed and needs to be parsed into a computer-native format to
have any meaning. So any serializer that’s working on *csv* has an
enormous disadvantage as compared to binary formats. Yet, the results
show that *data.table* is on par with binary formats and when more
threads are used, it can even be faster. Because of this impressive
performance, it was included in the graph for comparison.

## Bindings in other languages

**Julia**:
[**`fstformat.jl`**](https://github.com/xiaodaigh/fstformat.jl) A naive
Julia binding using RCall.jl

> **Note to users**: From CRAN release v0.8.0, the *fst* format is
> stable and backwards compatible. That means that all *fst* files
> generated with package v0.8.0 or later can be read by future versions
> of the package.
Loading

0 comments on commit b80b522

Please sign in to comment.