drake friendly alternative to dplyr::do #77

AlexAxthelm · 2017-08-28T20:27:57Z

wlandau-lilly · 2017-08-28T21:43:42Z

@AlexAxthelm So this is a way to break up a big data frame and process the chunks into multiple targets? That sounds super cool, I love it!

One small suggestion: it may be slightly more future-proof to use evaluate() directly rather than analyses(), though analyses() does get the point across more easily.

AlexAxthelm · 2017-08-29T13:12:11Z

Pretty much. I'll write up a little example code and the first few tests today, and go ahead and make a PR. I wrote it to break up my dataframes, after I kept hitting some hard limits (.Machine$integer.max) when calculating.

AlexAxthelm · 2017-08-29T20:52:33Z

I realized that It isn't playing well with multi-step pipechains, and was only including the base-object in the command in the plan. Unchecked that box.

wlandau-lilly · 2017-09-07T14:27:39Z

By the way, I just submitted a proposal to give a contributed talk on drake at RStudio::conf(2018). It is a long shot, but more than worth a try, and we should know by Oct 2 if the talk is accepted. This functionality would be perfect for that audience.

wlandau-lilly · 2017-11-11T23:21:25Z

By the way: I do not mean to derail this thread, but since I mentioned the RStudio::conf(2018) before, I should say that the proposal for drake was denied. However, last week, I submitted drake to rOpenSci. We share the same mission of reproducibility.

wlandau-lilly · 2017-11-27T03:29:04Z

@AlexAxthelm It has been a long time since we talked about this thread. I have not forgotten about you, and I am curious about your current thoughts and plans.

I really like the work you have done so far in #79. I had some new thoughts about a broader, more generalized approach, but then I remembered that evaluate_plan() and wildcard::wildcard() are already mirror images of gather_plan(). So the value added here is a solution that embraces the tidyverse and explicitly assumes that the data to be split is already loaded into memory. I think you have already done most of the work. Unfortunately, I am not very skilled or knowledgeable with dplyr, so I will need to rely on your guidance. (As a package developer and someone who has frequently written one-off analyses as packages, the non-standard evaluation of dplyr is a pain because it leads to "undefined global symbol" warnings in devtools::check(). So I usually resist it.)

AlexAxthelm · 2017-11-27T14:56:59Z

Hi!

This has not fallen (completely) off my radar. My current plan is to follow your advice (which I can't locate right now to link to) and build up functionality modularly. Ultimately, I'm thinking of having a function which would accept as key arguments an object, and a plan. The function would then split the object, run through the plan on each split element, and then recombine.

The plan in this case would be a sort of mini-plan. Practically, this would be similar in nature to what would go into the ... argument for dplyr::do, or the FUN for anything in apply family. But by changing it over to be a plan with some wildcards, we can drakeify multi-step pipelines for splits, and allow for a bit more modularity (swapping out rbind(master_plan, mini_plan) for rbind(master_plan, split_function(big_data, mini_plan, ...) for example).

I'll state as a goal here (not for the first round, but definitely by the second) that the mini_plan shouldn't need wildcards, but there should be an option for some deparse/substitute/grep deep magic to automatically identify the name of the object being split in the plan, and run with that. This should make debugging pipelines easier, since it would be literally the same plan running, and would make "hotswapping" split plans with unsplit ones easier.

On the backend, I'm planning on having a series of splitting/ recombining functions (non-exported), along the lines of split_list()/unsplit_list(), split_df()/unsplit_df(), etc. I'm going to prioritize the base R types, since they are mostly un-weird. I know that split_tbl()/unsplit_tbl() is going to be a bit of a pain, since I know that there is a lot of weirdness that can show up there, especially when we start talking about grouping, or lazy database queries 😨 (Which is unfortunately, one of my main use cases 🙃).

All of what I have here is primarily oriented around embarrassingly parallel problems, but that doesn't mean that a well-motivated and clever person couldn't make some nightmare-fuel wildcards in the mini_plan that would allow for dependencies between the parallel pipelines, but let's call that a distant pipe dream.

I'm pretty packed through the middle of December, 2017, but hopefully I'll be able to make some progress on this starting after that. I'll submit an update to #79 when I've got a stable function that at least works for splitting lists and ungrouped dataframes.

Side note: The dplyr NSE/SE stuff is kind of a nightmare. I'v usually gotten around it using something along the lines of

data %>%
  mutate(rlang::UQ(as.name("foobar")) := rlang::UQ(as.name("foo")) + rlang::UQ(as.name("bar"))

where foo, bar, and foobar are variables. It's not pretty though, and I'm lamenting the deprication of the select_/mutate_/summarise_ family. For the base R split functions (list, df) I'm planning on keeping those in base R.

wlandau · 2017-11-27T20:01:05Z

Glad to hear you are still planning to continue. I am very inretested, and this is something I would probably not implement myself.

I agree that splitting/unsplitting separately for each data type seems like the way to go and that dplyr/group_by is its own special case.

wlandau · 2018-02-05T02:16:21Z

Seems like #233 could replace this one. Thoughts?

AlexAxthelm · 2018-02-05T15:20:57Z

TL;DR: This could probably work, but I have concerns about overly large targets.

re #233: I think that you might be right here. Since I opened this issue, the work that I had been planning on using it for has shifted, and I've changed from using flat files and loading in memory, to working with DBI, and SQL as a data store as a workaround for objects not fitting into system. I've been thinking carefully about how #233 might work differently than what I'm considering in this issue, and as near as I can tell, #233 is a lot of good re-organizing on how complicated workflows are constructed, but I would want to know more about the actual execution plans.

My original objective with this issue was to parallelize large computations across targets, so that by using many small targets, I could perform similar computations across time and processors. I'm concerned that a plan such as the one in 233 is the opposite of that. As an example, building the example plan through analyses lets me see this:

> analyses$dataset[[1]] %>% object.size
1544 bytes
> analyses
Source: local data frame [4 x 3]
Groups: <by row>

# A tibble: 4 x 3
                dataset    reg   result
                 <list> <list>   <list>
1 <data.frame [48 x 2]>  <fun> <S3: lm>
2 <data.frame [48 x 2]>  <fun> <S3: lm>
3 <data.frame [64 x 2]>  <fun> <S3: lm>
4 <data.frame [64 x 2]>  <fun> <S3: lm>
> analyses %>% object.size()
120160 bytes
> analyses$dataset[[1]] %>% object.size
1544 bytes

I know that comparing object sizes in R is complicated business, but it looks like at the least, the analyses data frame will be the sum of each distinct data set. This could be somewhat mitigated by using transmute instead of mutate, but I worry that for many replicates, or large datasets this could get unwieldy, without some back-end lazy evaluation magic.

I'm not sure how you would handle this type of thing without pretty heavy changes to make(), which would look for something like rowwise or group_by family, and break a single target down. if @krlmlr has suggestions, that I'm not seeing, that would be great.

If you want to close this issue, I'm totally on board with that. When I get some time, I am planning on adding a vignette about using drake with DBI that makes a lot of what I was planning here less urgent, so maybe the need for something like this doesn't actually exist, although I'm hesitant to offer "use databases" as a solution to overly large targets.

wlandau · 2018-02-06T04:56:01Z

I'd love a drake/DBI vignette! I look forward to the PR. Incidentally, #227 and #236 will make it possible to seriously use storr_dbi() for projects with HPC.

Since you gave permission, I am closing this issue in favor of #233.

AlexAxthelm · 2018-02-06T05:05:22Z

Should I add as a separate vignette, or include in the best practices one? It seems that it might be a bit of an edge use-case. Also, doesn't storr_dbi limit to one job? I don't yet know how that would play with dbi based targets.

wlandau · 2018-02-06T05:10:23Z

I think relatively few drake users will also use DBI, so I think it could be its own vignette. But if you don't think you will need much space and you think the lessons generalize, then we could think about adding it to the best practices one. storr_dbi() is not threadsafe, and it currently limits drake to one job. However, for a future-powered scheduler with an option to make the master process do all the caching, we can start to scale with it. Of course, if storage is the bottleneck, then the parallel computing won't really help. But I think this could aid projects with large computations and small-ish data.

wlandau · 2018-02-06T05:15:15Z

Let me rephrase that: I predict that relatively few drake users will also use DBI directly. DBI itself has several thousand CRAN downloads per day, so clearly it is one of the most consumed R packages to date.

wlandau · 2019-11-03T01:39:01Z

@AlexAxthelm, FYI: #1042, #1042 (comment).

AlexAxthelm changed the title ~~drake do~~ drake friendly alternative to dplyr::do Aug 28, 2017

AlexAxthelm mentioned this issue Aug 29, 2017

Initial work on splitting #79

Closed

wlandau-lilly mentioned this issue Aug 31, 2017

Other interfaces and templating options #26

Closed

wlandau-lilly added the TOP PRIORITY label Sep 4, 2017

wlandau-lilly modified the milestone: v4.2.0 CRAN release Sep 4, 2017

This was referenced Sep 6, 2017

Hybrid parallel computing options #28

Closed

Additional modes of parallelism #42

Closed

AlexAxthelm mentioned this issue Sep 7, 2017

External packages as dependencies #6

Closed

wlandau-lilly assigned AlexAxthelm Sep 22, 2017

wlandau-lilly added status: priority type: new feature and removed TOP PRIORITY labels Sep 24, 2017

wlandau-lilly removed this from the v4.2.0 CRAN release milestone Sep 29, 2017

wlandau-lilly removed the status: priority label Oct 28, 2017

wlandau-lilly mentioned this issue Jan 5, 2018

Have evaluate_plan handle lists of vectors of indices #187

Closed

wlandau mentioned this issue Jan 30, 2018

Incorporate rlang tidy evaluation in drake #200

Closed

wlandau added the topic: api label Jan 31, 2018

wlandau added the undecided: may or may not fix label Feb 5, 2018

AlexAxthelm mentioned this issue Feb 5, 2018

DSL based on dplyr-like verbs? #233

Closed

wlandau closed this as completed Feb 6, 2018

wlandau mentioned this issue Apr 12, 2019

new transform function split to chunk a data.frame #833

Closed

2 tasks

This was referenced May 20, 2019

Dynamic branching #685

Closed

Data splitting #876

Merged

wlandau mentioned this issue Jul 18, 2020

Drake plan includes external dependencies stored on the HPC file system #1295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drake friendly alternative to dplyr::do #77

drake friendly alternative to dplyr::do #77

AlexAxthelm commented Aug 28, 2017 •

edited

Loading

wlandau-lilly commented Aug 28, 2017 •

edited

Loading

AlexAxthelm commented Aug 29, 2017

AlexAxthelm commented Aug 29, 2017

wlandau-lilly commented Sep 7, 2017

wlandau-lilly commented Nov 11, 2017

wlandau-lilly commented Nov 27, 2017

AlexAxthelm commented Nov 27, 2017

wlandau commented Nov 27, 2017

wlandau commented Feb 5, 2018

AlexAxthelm commented Feb 5, 2018

wlandau commented Feb 6, 2018

AlexAxthelm commented Feb 6, 2018

wlandau commented Feb 6, 2018 •

edited

Loading

wlandau commented Feb 6, 2018 •

edited

Loading

wlandau commented Nov 3, 2019

drake friendly alternative to dplyr::do #77

drake friendly alternative to dplyr::do #77

Comments

AlexAxthelm commented Aug 28, 2017 • edited Loading

Current/Planned features:

wlandau-lilly commented Aug 28, 2017 • edited Loading

AlexAxthelm commented Aug 29, 2017

AlexAxthelm commented Aug 29, 2017

wlandau-lilly commented Sep 7, 2017

wlandau-lilly commented Nov 11, 2017

wlandau-lilly commented Nov 27, 2017

AlexAxthelm commented Nov 27, 2017

wlandau commented Nov 27, 2017

wlandau commented Feb 5, 2018

AlexAxthelm commented Feb 5, 2018

wlandau commented Feb 6, 2018

AlexAxthelm commented Feb 6, 2018

wlandau commented Feb 6, 2018 • edited Loading

wlandau commented Feb 6, 2018 • edited Loading

wlandau commented Nov 3, 2019

AlexAxthelm commented Aug 28, 2017 •

edited

Loading

wlandau-lilly commented Aug 28, 2017 •

edited

Loading

wlandau commented Feb 6, 2018 •

edited

Loading

wlandau commented Feb 6, 2018 •

edited

Loading