Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drake friendly alternative to dplyr::do #77

Closed
7 of 16 tasks
AlexAxthelm opened this issue Aug 28, 2017 · 15 comments
Closed
7 of 16 tasks

drake friendly alternative to dplyr::do #77

AlexAxthelm opened this issue Aug 28, 2017 · 15 comments

Comments

@AlexAxthelm
Copy link
Collaborator

AlexAxthelm commented Aug 28, 2017

Most of my development is done on my local machine, which chokes on datasets that most people wouldn't even think of as "big data". One of the reasons that I was drawn to drake in the first place was the idea that I could segment my data, and let my machine work on each part individually, making checkpoints along the way so that if it does choke, I don't have to re-run the whole thing.

So I wrote a series of functions that give me similar functionality to dplyr::do(), but in drake. Specifically, I've written drake_split() and drake_unsplit() commands, which output plans. This allows me to horribly misuse creatively exploit the analyses() function to process my small data chunks in parallel, including tasks such as validating (using assertr) and cleaning.

I've got the bones of this "issue" solved, but I don't want to turn it into a PR until I can write some tests and maybe a vignette or example (hopefully later this week). Mostly, I'm opening this to solicit requests for functionality and to put it on the roadmap for later milestones.

Current/Planned features:

  • Splits automatically:
    • data.frame
    • tibble
    • data.table (It might do this, I haven't tested at all)
  • Recombines automatically
  • Plays well with magrittr pipe
    • Plays well with multi-step pipechains (Ex: mtcars %>% head %>% drake_split())
  • Respects dplyr groups
    • Attempts to distribute groups evenly (by row count) among slices
  • Can apply multiple steps of analysis to each parallel slice (I use this to validate, clean, and then analyze each slice)
  • Offer a choice of using the dplyr of base functions for splitting/binding
  • Documentation
    • Roxygen docs
    • Example script
    • Vignette
  • Unit Tests for each feature
@AlexAxthelm AlexAxthelm changed the title drake do drake friendly alternative to dplyr::do Aug 28, 2017
@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Aug 28, 2017

@AlexAxthelm So this is a way to break up a big data frame and process the chunks into multiple targets? That sounds super cool, I love it!

One small suggestion: it may be slightly more future-proof to use evaluate() directly rather than analyses(), though analyses() does get the point across more easily.

@AlexAxthelm
Copy link
Collaborator Author

Pretty much. I'll write up a little example code and the first few tests today, and go ahead and make a PR. I wrote it to break up my dataframes, after I kept hitting some hard limits (.Machine$integer.max) when calculating.

@AlexAxthelm
Copy link
Collaborator Author

I realized that It isn't playing well with multi-step pipechains, and was only including the base-object in the command in the plan. Unchecked that box.

@wlandau-lilly
Copy link
Collaborator

By the way, I just submitted a proposal to give a contributed talk on drake at RStudio::conf(2018). It is a long shot, but more than worth a try, and we should know by Oct 2 if the talk is accepted. This functionality would be perfect for that audience.

@wlandau-lilly
Copy link
Collaborator

By the way: I do not mean to derail this thread, but since I mentioned the RStudio::conf(2018) before, I should say that the proposal for drake was denied. However, last week, I submitted drake to rOpenSci. We share the same mission of reproducibility.

@wlandau-lilly
Copy link
Collaborator

@AlexAxthelm It has been a long time since we talked about this thread. I have not forgotten about you, and I am curious about your current thoughts and plans.

I really like the work you have done so far in #79. I had some new thoughts about a broader, more generalized approach, but then I remembered that evaluate_plan() and wildcard::wildcard() are already mirror images of gather_plan(). So the value added here is a solution that embraces the tidyverse and explicitly assumes that the data to be split is already loaded into memory. I think you have already done most of the work. Unfortunately, I am not very skilled or knowledgeable with dplyr, so I will need to rely on your guidance. (As a package developer and someone who has frequently written one-off analyses as packages, the non-standard evaluation of dplyr is a pain because it leads to "undefined global symbol" warnings in devtools::check(). So I usually resist it.)

@AlexAxthelm
Copy link
Collaborator Author

Hi!

This has not fallen (completely) off my radar. My current plan is to follow your advice (which I can't locate right now to link to) and build up functionality modularly. Ultimately, I'm thinking of having a function which would accept as key arguments an object, and a plan. The function would then split the object, run through the plan on each split element, and then recombine.

The plan in this case would be a sort of mini-plan. Practically, this would be similar in nature to what would go into the ... argument for dplyr::do, or the FUN for anything in apply family. But by changing it over to be a plan with some wildcards, we can drakeify multi-step pipelines for splits, and allow for a bit more modularity (swapping out rbind(master_plan, mini_plan) for rbind(master_plan, split_function(big_data, mini_plan, ...) for example).

I'll state as a goal here (not for the first round, but definitely by the second) that the mini_plan shouldn't need wildcards, but there should be an option for some deparse/substitute/grep deep magic to automatically identify the name of the object being split in the plan, and run with that. This should make debugging pipelines easier, since it would be literally the same plan running, and would make "hotswapping" split plans with unsplit ones easier.

On the backend, I'm planning on having a series of splitting/ recombining functions (non-exported), along the lines of split_list()/unsplit_list(), split_df()/unsplit_df(), etc. I'm going to prioritize the base R types, since they are mostly un-weird. I know that split_tbl()/unsplit_tbl() is going to be a bit of a pain, since I know that there is a lot of weirdness that can show up there, especially when we start talking about grouping, or lazy database queries 😨 (Which is unfortunately, one of my main use cases 🙃).

All of what I have here is primarily oriented around embarrassingly parallel problems, but that doesn't mean that a well-motivated and clever person couldn't make some nightmare-fuel wildcards in the mini_plan that would allow for dependencies between the parallel pipelines, but let's call that a distant pipe dream.

I'm pretty packed through the middle of December, 2017, but hopefully I'll be able to make some progress on this starting after that. I'll submit an update to #79 when I've got a stable function that at least works for splitting lists and ungrouped dataframes.

Side note: The dplyr NSE/SE stuff is kind of a nightmare. I'v usually gotten around it using something along the lines of

data %>%
  mutate(rlang::UQ(as.name("foobar")) := rlang::UQ(as.name("foo")) + rlang::UQ(as.name("bar"))

where foo, bar, and foobar are variables. It's not pretty though, and I'm lamenting the deprication of the select_/mutate_/summarise_ family. For the base R split functions (list, df) I'm planning on keeping those in base R.

@wlandau
Copy link
Member

wlandau commented Nov 27, 2017

Glad to hear you are still planning to continue. I am very inretested, and this is something I would probably not implement myself.

I agree that splitting/unsplitting separately for each data type seems like the way to go and that dplyr/group_by is its own special case.

@wlandau
Copy link
Member

wlandau commented Feb 5, 2018

Seems like #233 could replace this one. Thoughts?

@AlexAxthelm
Copy link
Collaborator Author

TL;DR: This could probably work, but I have concerns about overly large targets.

re #233: I think that you might be right here. Since I opened this issue, the work that I had been planning on using it for has shifted, and I've changed from using flat files and loading in memory, to working with DBI, and SQL as a data store as a workaround for objects not fitting into system. I've been thinking carefully about how #233 might work differently than what I'm considering in this issue, and as near as I can tell, #233 is a lot of good re-organizing on how complicated workflows are constructed, but I would want to know more about the actual execution plans.

My original objective with this issue was to parallelize large computations across targets, so that by using many small targets, I could perform similar computations across time and processors. I'm concerned that a plan such as the one in 233 is the opposite of that. As an example, building the example plan through analyses lets me see this:

> analyses$dataset[[1]] %>% object.size
1544 bytes
> analyses
Source: local data frame [4 x 3]
Groups: <by row>

# A tibble: 4 x 3
                dataset    reg   result
                 <list> <list>   <list>
1 <data.frame [48 x 2]>  <fun> <S3: lm>
2 <data.frame [48 x 2]>  <fun> <S3: lm>
3 <data.frame [64 x 2]>  <fun> <S3: lm>
4 <data.frame [64 x 2]>  <fun> <S3: lm>
> analyses %>% object.size()
120160 bytes
> analyses$dataset[[1]] %>% object.size
1544 bytes

I know that comparing object sizes in R is complicated business, but it looks like at the least, the analyses data frame will be the sum of each distinct data set. This could be somewhat mitigated by using transmute instead of mutate, but I worry that for many replicates, or large datasets this could get unwieldy, without some back-end lazy evaluation magic.

I'm not sure how you would handle this type of thing without pretty heavy changes to make(), which would look for something like rowwise or group_by family, and break a single target down. if @krlmlr has suggestions, that I'm not seeing, that would be great.

If you want to close this issue, I'm totally on board with that. When I get some time, I am planning on adding a vignette about using drake with DBI that makes a lot of what I was planning here less urgent, so maybe the need for something like this doesn't actually exist, although I'm hesitant to offer "use databases" as a solution to overly large targets.

@wlandau
Copy link
Member

wlandau commented Feb 6, 2018

I'd love a drake/DBI vignette! I look forward to the PR. Incidentally, #227 and #236 will make it possible to seriously use storr_dbi() for projects with HPC.

Since you gave permission, I am closing this issue in favor of #233.

@wlandau wlandau closed this as completed Feb 6, 2018
@AlexAxthelm
Copy link
Collaborator Author

Should I add as a separate vignette, or include in the best practices one? It seems that it might be a bit of an edge use-case. Also, doesn't storr_dbi limit to one job? I don't yet know how that would play with dbi based targets.

@wlandau
Copy link
Member

wlandau commented Feb 6, 2018

I think relatively few drake users will also use DBI, so I think it could be its own vignette. But if you don't think you will need much space and you think the lessons generalize, then we could think about adding it to the best practices one. storr_dbi() is not threadsafe, and it currently limits drake to one job. However, for a future-powered scheduler with an option to make the master process do all the caching, we can start to scale with it. Of course, if storage is the bottleneck, then the parallel computing won't really help. But I think this could aid projects with large computations and small-ish data.

@wlandau
Copy link
Member

wlandau commented Feb 6, 2018

Let me rephrase that: I predict that relatively few drake users will also use DBI directly. DBI itself has several thousand CRAN downloads per day, so clearly it is one of the most consumed R packages to date.

@wlandau
Copy link
Member

wlandau commented Nov 3, 2019

@AlexAxthelm, FYI: #1042, #1042 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants