-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DSL based on dplyr-like verbs? #233
Comments
@AlexAxthelm: Just to clarify: We use these verbs only to describe the workflow, the actual execution plan will be a full expansion. But if we know that e.g. all |
You may not have a clear answer for this, but as a design question, would each element (row) inside one of these tibble targets show separately on |
These questions are great! We have an internal graph, there each row in |
That sounds awesome. I'm fully in support of this over #77 then. A feature that I would find useful as a user, would be to have an accessible estimate on how outdated a target is, e.g 7 of 9 elements need rebuilt. Other than that, I think all my concerns are met. |
I am having trouble understanding this discussion, possibly because I do not use |
That's a point I haven't stressed enough. We use the DSL to avoid plan expansion for as long as we possibly can. |
That would certainly gain us efficiency, and it would solve the catch-22 from #77 ( |
Thankfully, none of this should affect the cache, which is the most sensitive component of all when it comes to back compatibility. |
@krlmlr, to confirm, the This seems like there is potential for a lot of good power here, but I'm concerned about imposing a dplyr point-of-view on an otherwise agnostic tool. |
For the sake of slow deprecation, I think we will get the chance to see how well the two interfaces coexist. Ideally, I would like There is nothing incorrect about wildcard templating, but I think we could eventually offload it to the wildcard package and extend |
I'd say everything's a tibble, but I'm not sure about details here. I need to take a closer look at wildcard. |
There's not actually that much to look at, it's a simple idea I took out of remakeGenerator. |
I would be in favor of turning ordinary workflow data frames into tibbles as early on as possible. It really is about time. Speaking of tibbles, there was a discussion somewhere about fixed column width printing, but I can't seem to find it. Workflow plan commands can get long. |
Character columns take as much space as they can get in tibble, but embedded newlines are not a problem. Does that answer your question? |
I think I finally figured out how to express my concerns about all targets being tibbles: What happens when I want to build something which is normally a tibble? drake_plan(
foo = mtcars %>% as.tibble() %>% filter(am == 1)
) will this try to bring all of the parallelism to bear on each row of the tibble? Is there a way to separate a a tibble which is "just a tibble" from one that is "a drake_plan™ tibble"? What happens if I want to use NSE to build my plan as a tibble outside of drake_plan? |
We might need to invent verbs like |
I am all for this! Dynamic branching ("delayed expansion"?) is very exciting and I recently ran into a problem with my data work that required it. I think an entirely different set of functions should be used, so that it's clear what is a tibble operation within a target, and which is an operation involving many drake targets. I suppose these functions would only be valid inside a target definition, like plan_drake(
small = simulate(48),
large = simulate(64),
analyses = expand_target(
reg(dataset),
# crossing() is implicit
reg = list(reg1, reg2),
dataset = list(small, large)
),
summaries = expand_target(
# I wasn't sure why you had fun(dataset, result) here
# Not yet sure how to include that in this proposal
fun(result),
fun = list(coefficients, residuals),
result = analyses
),
# gather_targets evaluates to a tibble
# with a column for every expansion term
# used previously I guess? & value column
winners = gather_targets(summaries) %>%
group_by(dataset, fun) %>%
summarize(winner = min(value))
) Expanding on this idea of special functions within target definitions, I've been imagining a feature where you could specify file targets & file dependencies from target defs instead of messing around with the single-quotes vs double-quotes thing. Or even triggers. EDIT: krlmlr had the same idea in #232 oops plan_drake(
imported_data = read_csv(target_filedep("data.csv")),
report.md = target_file(knit(x)),
always_build = target_trigger("always", fun(x))
) |
Update: I started a GitHub project for this. I have not worked on the DSL at all this past year, but I do care a lot about it. From my end, the size and complexity of In the initial stages of the DSL, because of the sheer scope of @krlmlr's idea, I would prefer to treat the API as totally separate from how targets are declared and built. It is relatively easy to play around with how As |
@krlmlr, @AlexAxthelm, @dapperjapper, and @rkrug: For #233 (comment), I think we can use something that
library(drake)
drake_plan(
small = target(simulate(48), data = small),
large = target(simulate(64), data = large),
reg = target(
reg_fun(data),
do = crossing,
by = list(reg_fun, data),
reg_fun = list(reg1, reg2)
),
summary = target(
sum_fun(data, reg),
do = crossing,
by = list(sum_fun, reg),
sum_fun = list(coefficients, residuals)
),
winners = target(
min(summary),
do = summarize,
by = list(data, sum_fun)
)
)
#> # A tibble: 5 x 7
#> target command data do by reg_fun sum_fun
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 small simulate(48) small <NA> <NA> <NA> <NA>
#> 2 large simulate(64) large <NA> <NA> <NA> <NA>
#> 3 reg reg_fun(dat… <NA> crossi… list(reg_… list(reg1… <NA>
#> 4 summary sum_fun(dat… <NA> crossi… list(sum_… <NA> list(coefficien…
#> 5 winners min(summary) <NA> summar… list(data… <NA> <NA> Created on 2019-01-13 by the reprex package (v0.2.1) As long as I am thinking out loud:
...though |
A big thanks to @krlmlr for the |
I will keep iterating on this. We might find a sweet spot with the right combination of language and custom columns. An improvement: library(drake)
drake_plan(
small = simulate(48),
large = simulate(64),
reg = target(
reg_fun(data),
transform = cross(reg_fun = c(reg1, reg2), data = c(small, large))
),
summary = target(
sum_fun(data, reg),
transform = cross(sum_fun = c(coefficients, residuals), reg)
),
winners = target(
min(summary),
transform = summarize(data, sum_fun)
)
)
#> # A tibble: 5 x 3
#> target command transform
#> <chr> <chr> <chr>
#> 1 small simulate(48) <NA>
#> 2 large simulate(64) <NA>
#> 3 reg reg_fun(data) cross(reg_fun = c(reg1, reg2), data = c(small,…
#> 4 summary sum_fun(data, re… cross(sum_fun = c(coefficients, residuals), re…
#> 5 winners min(summary) summarize(data, sum_fun) Created on 2019-01-14 by the reprex package (v0.2.1) |
See #674 for an experimental API inspired by the proposed DSL. The implementation is lightweight, and because it relies on a custom "transform" column in the plan, it does not interfere with any other functionality (internals or API). |
New capability: define custom groupings with a library(drake)
plan <- drake_plan(
small = simulate(48),
large = simulate(64),
reg1 = target(
reg_fun(data),
transform = cross(data = c(small, large)),
group = reg
),
reg2 = target(
reg_fun(data),
transform = cross(data = c(small, large)),
group = reg
),
winners = target(
min(reg),
transform = summarize(data),
a = 1
)
)
plan
#> # A tibble: 8 x 3
#> target command a
#> <chr> <chr> <dbl>
#> 1 small simulate(48) NA
#> 2 large simulate(64) NA
#> 3 reg1_small reg_fun(small) NA
#> 4 reg1_large reg_fun(large) NA
#> 5 reg2_large reg1_large_fun(large) NA
#> 6 reg2_small reg1_small_fun(small) NA
#> 7 winners_large min(reg1_large = reg1_large, reg2_large = reg2_large) 1
#> 8 winners_small min(reg1_small = reg1_small, reg2_small = reg2_small) 1
drake_plan_source(plan)
#> drake_plan(
#> small = simulate(48),
#> large = simulate(64),
#> reg1_small = reg_fun(small),
#> reg1_large = reg_fun(large),
#> reg2_large = reg1_large_fun(large),
#> reg2_small = reg1_small_fun(small),
#> winners_large = target(
#> command = min(reg1_large = reg1_large, reg2_large = reg2_large),
#> a = 1
#> ),
#> winners_small = target(
#> command = min(reg1_small = reg1_small, reg2_small = reg2_small),
#> a = 1
#> )
#> )
config <- drake_config(plan)
vis_drake_graph(config) Created on 2019-01-16 by the reprex package (v0.2.1) |
@wlandau What version is this work targeted for? |
The very next release: 7.0.0 |
My two cents to this (FYI): So do you for now have transform and grouping? Is there likely other functionality in the future? I am not sure if it does not overload the target argument. Also, I am not sure how these are verbs and domain specific (just mentioning this because that's how I understood the initial idea). Just from looking at the syntax, the idea of separating the plan creation with a wild card and the "folding it up" as it was before was simpler to digest for me. |
Yes. It is experimental (as the documentation now indicates) but behavior seems correct so far.
Dynamic branching is high on the list for long-term. But for this API specifically, hopefully we will not need more features. I would prefer to keep it simple, and it already seems to cover the vast majority of the use cases for the map/reduce functions and wildcards. But I could be convinced otherwise.
The
Yes, I agree. That does not bother me so much. At the interface level specifically, this still solves the same problem as the DSL.
I plan to keep the wildcard functions around for a long time. My personal experience with this new interface is actually more positive. It takes effort and bookkeeping to wrangle all those subplans and wildcards. I find it much easier to use transformations and grouping in a single call to |
Thanks @wlandau for the detailed answer, that makes sense. |
You are welcome. After talking with @krlmlr in person yesterday at RStudio conf, I have decided to think of this approach as the proper DSL. We can open a different issue for dynamic branching. Major changes needed to consider this issue solved:
command <- quote(reg_fun(x, "y", x, k))
eval(call("substitute", command, list(x = "str", k = quote(sym))))
#> reg_fun("str", "y", "str", sym) Created on 2019-01-18 by the reprex package (v0.2.1) |
Also, we should add a |
By the way, tidy evaluation works in the DSL. You can generate super large plans this way. A taste: sms <- rlang::syms(letters)
drake::drake_plan(x = target(f(char), transform = map(char = !!sms)))
#> # A tibble: 26 x 2
#> target command
#> <chr> <chr>
#> 1 x_a f(a)
#> 2 x_b f(b)
#> 3 x_c f(c)
#> 4 x_d f(d)
#> 5 x_e f(e)
#> 6 x_f f(f)
#> 7 x_g f(g)
#> 8 x_h f(h)
#> 9 x_i f(i)
#> 10 x_j f(j)
#> # … with 16 more rows Created on 2019-01-20 by the reprex package (v0.2.1) |
Cool. Should functions like |
Fortunately, we avoid namespace conflicts entirely because map() is in 'transform', not the command. The DSL code is analyzed statically and not executed in the usual sense. |
Ok, I did not know that. I think the average user also may not know it. If it is not an exported function, I guess and there is also no documentation exported, i.e. |
In the specific case of Hopefully we can make the documentation friendly and thorough. I just pushed update to the |
tibble()
,crossing()
,mutate()
,group_by()
,summarize()
plan_analyses()
andplan_summaries()
Sketch for "basic" example
We might want to use our own verbs, this is the current tidyverse nomenclature to communicate semantics.
The text was updated successfully, but these errors were encountered: