-
Notifications
You must be signed in to change notification settings - Fork 0
/
05a_data-io.qmd
426 lines (294 loc) · 18.7 KB
/
05a_data-io.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
---
title: "Efficient Data Input/Output (I/O)"
---
Before we can work with data within R, we first have to be able to read it in. Conversely, once we've finished processing or analysing our data, we might need to write out final or intermediate results.
Many factors will go into deciding which format and which read & write functions we might choose for our data. For example:
- File size
- Portability
- Interoperability
- Human readability
In this section we'll a number of the most common file formats for data (primarily tabular) and summarise their characteristics.
We'll also compare and benchmark functions and packages available in R for reading an writing them.
## File formats
### Flat files
Some of the most common file formats we might be working with when dealing with tabular data are flat delimited text files. Such files store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. A couple of well known examples are:
- Comma-Separated Values files (CSVs): use a comma to separate values.
- Tab-Separated Values files (TSVs): use a tab to separate values.
They are ubiquitous and human readable but as you will see, they take up quite a lot of disk space (comparatively) and can be slow to read and write when dealing with large files
#### Packages/functions that can read/write delimited text files:
##### Relevant functions
###### Read
- `read.csv()` / `read.delim()`
- `readr::read_csv()`
- `data.table::fread()`
- `arrow::read_csv_arrow()`
###### Write
- `write.csv()` / `write.delim()`
- `readr::write_csv()`
- `data.table::fwrite()`
- `arrow::write_csv_arrow()`
### Binary files
If you look at [Wikipedia](https://en.wikipedia.org/wiki/Binary_file) for a definition of Binary files, you get:
> A **binary file** is a [computer file](https://en.wikipedia.org/wiki/Computer_file "Computer file") that is not a [text file](https://en.wikipedia.org/wiki/Text_file "Text file") 😜
You'll also learn that binary files are usually thought of as being a sequence of [bytes](https://en.wikipedia.org/wiki/Byte "Byte"), and that some binary files contain headers, blocks of metadata used by a computer program to interpret the data in the file. Because they are stored in bytes, they are not human readable unless viewed through specialised viewers.
The process of writing out data to a binary format is called binary serialisation and different format can use different serialisation methods.
Let's look at some binary formats you might consider as an R user.
#### `RData/RDS` formats
`.RData` and `.rds` files are binary formats specific to R that can be used to read complete R objects, so not just restricted to tabular data. They can therefore be good options for storing more complicated object like models etc. `.RData` files can store multiple objects while .`rds` are designed to contain a single object. Pertinent characteristics of such files:
- Can be faster to restore the data to R (but not necessarily as fast to write).
- Can preserve R specific information encoded in the data (e.g., attributes, variable types, etc).
- Are R specific so not interoperable outside of R environments.
- In R 3.6, the default serialisation version used to write `.Rdata` and `.rds` binary files changed from 2 to 3. This means that files serialised with version 3 will not be able to read by others running R \< 3.5.0 which limits interoperability even between R users.
Overall, while good for writing R objects, I would reserve writing such files only for ephemeral intermediate results or for more complex objects, where other formats are not appropriate. Be mindful of the serialisation version you use if you want users running R \< 3.5.0 to be able to read them.
##### Relevant functions
###### Write
- `save()`: for writing `.RData` files.
- `saveRDS()`: for writing `.rds` files.
###### Read
- `load()`: for writing `.RData` files.
- `readRDS()`: for writing `.rds` files.
#### Apache parquet/arrow:
While different file formats, I've bundled these two together because they are both Apache Foundation data formats. We also use the same R package (`arrow`) to read & write them.
- [**Apache Parquet**](https://parquet.apache.org/) is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
- [**Apache Arrow**](https://arrow.apache.org/docs/format/Columnar.html#format-columnar) defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs.
Parquet is a storage format designed for maximum space efficiency, whereas Arrow is an in-memory format intended for operation by vectorized computational kernels.
The formats, as well as the `arrow` R package to interact with them, are part of the Apache Arrow software development platform for building high performance applications that process and transport large data sets.
::: callout-note
*You may have noticed the files I shared in `data/` as part of the course materials were all parquet files. That's because the compression of parquet files meant I could write a 10,000,000 table of data to a \~67 MB file (compared to over 1GB in csv format!) and allowed me to share it through GitHub (and you to download it in a more acceptable time frame!*
:::
##### Relevant functions
###### Write
- `arrow::write_parquet()`: for writing Apache parquet files.
- `arrow::write_feather():` for writing arrow IPC format files (arrow represent version 2 of feather files, hence the confusing name of the function).
###### Read
- `arrow::read_parquet()`: for reading Apache parquet files.
- `arrow::read_feather():` for reading arrow IPC format files.
#### `fst`
The [*fst* package](https://github.com/fstpackage/fst) for R is based on a number of C++ libraries and provides a fast, easy and flexible way to serialize data frames into the `fst` binary format. With access speeds of multiple GB/s, *fst* is specifically designed to unlock the potential of high speed solid state disks that can be found in most modern computers.
The *fst* file format provides full random access to stored datasets allowing retrieval of subsets of both columns and rows from a file. Files are also compressed.
##### Relevant functions
###### Write
- `fst::write.fst()`: for writing `fst` files.
###### Read
- `fst::read.fst()`: for reading `fst` files.
#### `qs`
Package [`qs`](https://github.com/traversc/qs) provides an interface for quickly saving and reading objects to and from disk. The goal of this package is to provide a lightning-fast and complete replacement for the `saveRDS` and `readRDS` functions in R.
`saveRDS` and `readRDS` are the standard for serialization of R data, but these functions are not optimized for speed. On the other hand, `fst` is extremely fast, but only works on `data.frame`\'s and certain column types.
`qs` is both extremely fast and general: it can serialize any R object like `saveRDS` and is just as fast and sometimes faster than `fst`.
###### Write
- `qs::qsave()`: for serialising R objects to `qs` files.
###### Read
- `qs::qload()`: for loading `qs` files.
# Benchmarks
Now that we've discussed a bunch of relevant file formats and the packages used to read and write them, let's go ahead and test out the comparative performance of reading and writing them, as well as the file sizes of different formats.
## Writing data
Let's start by comparing write efficiency.
Before we start, we'll need some data to write. So let's load one of the parquet files from the course materials. Let's go for the file with 1,000,000 rows. If you want to speed up the testing you can use the file with 100,000 rows by changing the value of `nrow`.
```{r}
n_rows <- 1000000L
data <- arrow::read_parquet(here::here("data", paste0("synthpop_", n_rows, ".parquet")))
```
Let's also load `dplyr` for the pipe and other helpers:
```{r}
#| output: false
library(dplyr)
```
Let's now create a directory to write our data to:
```{r}
out_dir <- here::here("data", "write")
fs::dir_create(out_dir)
```
To compare each file format and function combination (where appropriate), I've written a function that uses the vale of the `format` argument and the `switch()` function to deploy different write function/format combination for writing out the data.
```{r}
write_dataset <- function(data,
format = c("csv", "csv_readr", "csv_dt", "csv_arrow",
"parquet", "arrow", "rdata", "rds", "fst", "qs"),
out_dir,
file_name = paste0("synthpop_", n_rows, "_")) {
switch (format,
## FLAT FILES ###
# write cvs using base
csv = write.csv(data,
file = fs::path(out_dir,
paste0(file_name, format),
ext = "csv"),
row.names = FALSE),
# write csv using readr
csv_readr = readr::write_csv(data,
file = fs::path(
out_dir,
paste0(file_name, format),
ext = "csv")),
# write csv using data.table
csv_dt = data.table::fwrite(data,
file = fs::path(
out_dir,
paste0(file_name, format),
ext = "csv")),
# write csv using arrow
csv_arrow = arrow::write_csv_arrow(data,
file = fs::path(
out_dir,
paste0(file_name, format),
ext = "csv")),
## BINARY FILES ###
# write parquet using arrow
parquet = arrow::write_parquet(data, sink = fs::path(
out_dir,
paste0(file_name, format),
ext = "parquet")),
# write arrow IPC using arrow
arrow = arrow::write_feather(data, sink = fs::path(
out_dir,
paste0(file_name, format),
ext = "arrow")),
# write RData using base
rdata = save(data, file = fs::path(out_dir,
paste0(file_name, format),
ext = "RData"),
version = 2),
# write rds using base
rds = saveRDS(data, file = fs::path(out_dir,
paste0(file_name, format),
ext = "rds"),
version = 2),
# write fst using fst
fst = fst::write_fst(data, path = fs::path(out_dir,
paste0(file_name, format),
ext = "fst")),
# write qs using qs
qs = qs::qsave(data, file = fs::path(out_dir,
paste0(file_name, format),
ext = "qs"))
)
}
```
I've also write a function to process the `bench::mark()` output, removing unnecessary information, arranging the results in descending order of median and printing the result as a `gt()` table.
```{r}
print_bm <- function(benchmark) {
benchmark[, c("expression", "min", "result", "memory", "time", "gc")] <- NULL
benchmark %>%
arrange(median) %>%
gt::gt()
}
```
We're now ready to run our benchmarks. I've set them up as a `bench::press()` so we can run the same function every time but vary the `format` argument for each test:
```{r}
#| message: false
#| warning: false
bench::press(
format = c("csv", "csv_readr", "csv_dt", "csv_arrow",
"parquet", "arrow", "rdata", "rds", "fst", "qs"),
{
bench::mark(write_dataset(data, format = format, out_dir = out_dir))
}
) %>%
print_bm()
```
We see that:
- the fastest write format by quite some margin is the arrow format using `arrow::write_feather()`.
- All `arrow` package are actually quite efficient, all featuring in the top 5 for speed, regardless of format.
- For `csv` formats however, there is a clear winner, `data.table()`.
- Both `qs` and `fst` are, as advertised, quite fast and `qs` in particular should definitely be considered when needing to store more complex R objects.
- Base functions `write.csv()` , `save()` and `saveRDS` are often orders of magnitude slower.
### Size on disk
Let's also check how much space each file format takes up on disk:
```{r}
tibble::tibble(file = basename(fs::dir_ls(out_dir)),
size = file.size(fs::dir_ls(out_dir))) |>
arrange(size) |>
mutate(size = gdata::humanReadable(size,
standard="SI",
digits=1)) |>
gt::gt()
```
It's clear that binary formats take up a lot less space on disk that csv text files. At the extremes, parquet files take up over 17 times less space that a csv file written out with `write.csv()` or `arrow::write_csv_arrow()`.
## Reading data
Let's now use the files we created to test how efficient different formats and functions are in reading in.
Just like I did before with `write_dataset()`, I've written a function to read the appropriate file using the appropriate function according to the value of the `format` argument:
```{r}
read_dataset <- function(data, format = c("csv", "csv_readr", "csv_dt", "csv_arrow",
"parquet", "arrow", "rdata", "rds", "fst", "qs"),
out_dir,
file_name = paste0("synthpop_", n_rows, "_")) {
switch (format,
## FLAT FILES ###
# read cvs using base
csv = read.csv(file = fs::path(out_dir,
paste0(file_name, format),
ext = "csv")),
# read cvs using readr
csv_readr = readr::read_csv(file = fs::path(
out_dir,
paste0(file_name, format),
ext = "csv")),
# read cvs using data.table
csv_dt = data.table::fread(file = fs::path(
out_dir,
paste0(file_name, format),
ext = "csv")),
# read cvs using arrow
csv_arrow = arrow::read_csv_arrow(file = fs::path(
out_dir,
paste0(file_name, format),
ext = "csv")),
## BINARY FILES ###
# read parquet using arrow
parquet = arrow::read_parquet(file = fs::path(
out_dir,
paste0(file_name, format),
ext = "parquet")),
# read arrow using arrow
arrow = arrow::read_feather(file = fs::path(
out_dir,
paste0(file_name, format),
ext = "arrow")),
# read RData using base
rdata = load(file = fs::path(out_dir,
paste0(file_name, format),
ext = "RData")),
# read rds using base
rds = readRDS(file = fs::path(out_dir,
paste0(file_name, format),
ext = "rds")),
fst = fst::read_fst(path = fs::path(out_dir,
paste0(file_name, format),
ext = "fst")),
qs = qs::qload(file = fs::path(out_dir,
paste0(file_name, format),
ext = "qs"))
)
}
```
And again, I've set up our benchmarks as a `bench::press()` so we can run the same function every time but vary the `format` argument for each test:
Let's see how fast our format/function combos are at reading!
```{r}
#| message: false
#| warning: false
bench::press(
format = c("csv", "csv_readr", "csv_dt", "csv_arrow",
"parquet", "arrow", "rdata", "rds", "fst", "qs"),
{
bench::mark(
read_dataset(data, format = format, out_dir = out_dir),
relative = FALSE)
}
) %>%
print_bm()
```
Results of our experiments show that:
- The arrow format using `arrow::read_feather()` is again the fastest.
- Again all `arrow` functions are the fastest for reading, regardless of format, occupying the top 3.
- `data.table::fread()` is again very competitive for reading CSVs.
- `qs` also is highly performant, and a good function to know given it can be used for more complex objects
- base functions for reading files, whether binary or CSV are again the slowest by quite some margin.
- It should be noted that both `readr::read_csv()` and `read.csv()` can be made much faster by pre-specifying the data type for each column when reading.
##
::: callout-important
## Take Aways
- The `arrow` package offers some of the fastest functions for writing both flat (e.g. CSV) and binary files like `parquet` and `arrow`.
- The `arrow` format is especially fast to read and write.
- Functions from the `data.table` package are also solid contenders for reading and writing CSV files.
- Functions in package `qs` are also quite performant, especially given they can read and write more complex R objects.
- Binary files are the most disk space efficient, particularly the `parquet` file format.
:::