-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drake friendly alternative to dplyr::do #77
Comments
@AlexAxthelm So this is a way to break up a big data frame and process the chunks into multiple targets? That sounds super cool, I love it! One small suggestion: it may be slightly more future-proof to use |
Pretty much. I'll write up a little example code and the first few tests today, and go ahead and make a PR. I wrote it to break up my dataframes, after I kept hitting some hard limits ( |
I realized that It isn't playing well with multi-step pipechains, and was only including the base-object in the command in the plan. Unchecked that box. |
By the way, I just submitted a proposal to give a contributed talk on drake at RStudio::conf(2018). It is a long shot, but more than worth a try, and we should know by Oct 2 if the talk is accepted. This functionality would be perfect for that audience. |
By the way: I do not mean to derail this thread, but since I mentioned the RStudio::conf(2018) before, I should say that the proposal for |
@AlexAxthelm It has been a long time since we talked about this thread. I have not forgotten about you, and I am curious about your current thoughts and plans. I really like the work you have done so far in #79. I had some new thoughts about a broader, more generalized approach, but then I remembered that |
Hi! This has not fallen (completely) off my radar. My current plan is to follow your advice (which I can't locate right now to link to) and build up functionality modularly. Ultimately, I'm thinking of having a function which would accept as key arguments an object, and a plan. The function would then split the object, run through the plan on each split element, and then recombine. The plan in this case would be a sort of mini-plan. Practically, this would be similar in nature to what would go into the I'll state as a goal here (not for the first round, but definitely by the second) that the On the backend, I'm planning on having a series of splitting/ recombining functions (non-exported), along the lines of All of what I have here is primarily oriented around embarrassingly parallel problems, but that doesn't mean that a well-motivated and clever person couldn't make some nightmare-fuel wildcards in the mini_plan that would allow for dependencies between the parallel pipelines, but let's call that a distant pipe dream. I'm pretty packed through the middle of December, 2017, but hopefully I'll be able to make some progress on this starting after that. I'll submit an update to #79 when I've got a stable function that at least works for splitting lists and ungrouped dataframes. Side note: The dplyr NSE/SE stuff is kind of a nightmare. I'v usually gotten around it using something along the lines of data %>%
mutate(rlang::UQ(as.name("foobar")) := rlang::UQ(as.name("foo")) + rlang::UQ(as.name("bar")) where |
Glad to hear you are still planning to continue. I am very inretested, and this is something I would probably not implement myself. I agree that splitting/unsplitting separately for each data type seems like the way to go and that dplyr/group_by is its own special case. |
Seems like #233 could replace this one. Thoughts? |
TL;DR: This could probably work, but I have concerns about overly large targets. re #233: I think that you might be right here. Since I opened this issue, the work that I had been planning on using it for has shifted, and I've changed from using flat files and loading in memory, to working with My original objective with this issue was to parallelize large computations across targets, so that by using many small targets, I could perform similar computations across time and processors. I'm concerned that a plan such as the one in 233 is the opposite of that. As an example, building the example plan through > analyses$dataset[[1]] %>% object.size
1544 bytes
> analyses
Source: local data frame [4 x 3]
Groups: <by row>
# A tibble: 4 x 3
dataset reg result
<list> <list> <list>
1 <data.frame [48 x 2]> <fun> <S3: lm>
2 <data.frame [48 x 2]> <fun> <S3: lm>
3 <data.frame [64 x 2]> <fun> <S3: lm>
4 <data.frame [64 x 2]> <fun> <S3: lm>
> analyses %>% object.size()
120160 bytes
> analyses$dataset[[1]] %>% object.size
1544 bytes I know that comparing object sizes in R is complicated business, but it looks like at the least, the I'm not sure how you would handle this type of thing without pretty heavy changes to If you want to close this issue, I'm totally on board with that. When I get some time, I am planning on adding a vignette about using drake with |
Should I add as a separate vignette, or include in the best practices one? It seems that it might be a bit of an edge use-case. Also, doesn't |
I think relatively few |
Let me rephrase that: I predict that relatively few |
@AlexAxthelm, FYI: #1042, #1042 (comment). |
Most of my development is done on my local machine, which chokes on datasets that most people wouldn't even think of as "big data". One of the reasons that I was drawn to
drake
in the first place was the idea that I could segment my data, and let my machine work on each part individually, making checkpoints along the way so that if it does choke, I don't have to re-run the whole thing.So I wrote a series of functions that give me similar functionality to
dplyr::do()
, but in drake. Specifically, I've writtendrake_split()
anddrake_unsplit()
commands, which output plans. This allows me tohorribly misusecreatively exploit theanalyses()
function to process my small data chunks in parallel, including tasks such as validating (usingassertr
) and cleaning.I've got the bones of this "issue" solved, but I don't want to turn it into a PR until I can write some tests and maybe a vignette or example (hopefully later this week). Mostly, I'm opening this to solicit requests for functionality and to put it on the roadmap for later milestones.
Current/Planned features:
data.frame
tibble
data.table
(It might do this, I haven't tested at all)magrittr
pipemtcars %>% head %>% drake_split()
)dplyr
ofbase
functions for splitting/bindingRoxygen
docsThe text was updated successfully, but these errors were encountered: