Write more about dplyr memory usage #198

hadley · 2014-01-20T23:08:19Z

Starting from following email. Need to update to use changes. Should be a vignette.

We'll start by making a local copy of the internal dplyr function dfloc(). This function is very useful for helping us understanding how the memory in a data frame works.

library(dplyr)
dfloc <- dplyr:::dfloc

(dfloc will eventually be exported from dplyr once we've thought it through a bit more.)

dfloc() tells us the address of each vector in the data frame.

dfloc(iris)

If these addresses change between operations then we know R has made a copy. It's important to think about data frames as collections as columns rather than monolithic objects because for many operations we can reuse existing columns and not use any extra memory

In base R, a surprising number of operations make copies of the individual vectors. For example, when you extract two columns from a data frame, their contents are actually copied. There's no reason to do this!

# Copies the first two columns
dfloc(iris[1:2])

dfloc(iris)
# Copies all the columns!
iris$blah <- 1
dfloc(iris)

(This is something that may improve in R 3.1.0 due to some work by Michael Lawrence)

The goal of dplyr is to avoid making copies when not needed:

dfloc(iris)
dfloc(group_by(iris, Species))
dfloc(mutate(iris, area = Sepal.Length * Sepal.Width))
dfloc(select(iris, 1:3))

Currently, group_by() doesn't make a copy, but mutate() and select() do, so we'll fix that for the next version. Once we've done that, any sequence of mutate(), select() and group_by() will only need to occupy a little extra memory (i.e. for the indices and new variables). Saving interim results will not have any effect on memory usage.

Obviously there's no way around summarise() making a copy, but it's usually not a big deal since you're reducing the size of the data so much. arrange() also has to make a copy, but generally you can avoid using it since it won't affect any statistical operation. If ordering is important (e.g. for computing a cumulative mean), dplyr provides ways to avoid copying the whole data frame and instead only reorder just the columns you need (see the windowing vignette for more details).

Altogether, this means that dplyr lets you work with data frames with very little extra overhead. Eventually, dplyr will never create a complete copy of the data frame unless you're sorting it, and will provide tools so that you never need to sort it. This should mean that you can keep using data frames, and don't need to switch to a more complex object with reference semantics (like a data table).

The text was updated successfully, but these errors were encountered:

lock · 2018-09-16T16:25:54Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

hadley added the documentation label Mar 17, 2014

hadley added this to the v0.2 milestone Mar 17, 2014

hadley closed this as completed in dcf1db9 Mar 20, 2014

acthomasca mentioned this issue Jun 13, 2015

segfaulting problem on Ubuntu Linux, again #952

Closed

krlmlr added documentation and removed documentation labels Mar 20, 2018

lock bot locked and limited conversation to collaborators Sep 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write more about dplyr memory usage #198

Write more about dplyr memory usage #198

hadley commented Jan 20, 2014

lock bot commented Sep 16, 2018

Write more about dplyr memory usage #198

Write more about dplyr memory usage #198

Comments

hadley commented Jan 20, 2014

lock bot commented Sep 16, 2018