A new internal fast-lookup data structure for target information #440

wlandau · 2018-06-29T13:15:20Z

Lessons from #435 and #283:

drake spends a lot of time repeatedly looking up target-level information it should already know.
The dependency network is inflexible.

Explanation

I think these and similar problems come from the limitations of drake's data structures. The plan is great for users, but lookup is slow, and it contains only indirect information about each target's dependencies. The graph has complete dependency information, but it does not have other metadata (igraph attributes are a pain), and traversing the whole graph for dependencies is slow.

Solution

I want to put all target-level metadata in a unified fast-lookup data structure. This new config$meta will replace individual outputs of drake_meta() flying around, and it will decrease reliance on the plan and graph. For fast lookup, I want to base this data structure on an environment. For convenience, I want it to have a storr-like API. Best of both worlds: a storr_environment() cache. (Thank you @richfitz for making it available!)

There is a lot of refactoring to do, but I think this will make drake a whole lot faster and give it a cleaner code base.

The text was updated successfully, but these errors were encountered:

wlandau · 2018-06-29T13:43:54Z

config$meta storr namespaces at the level of individual data objects:

name: character string, names of the data objects and/or build steps
is_build_step: logical, whether the name is also the name of an import/target build step
imported: logical, imported from the user's workspace/environment or generated with a drake command?
file: whether the item is a file or simply a generic R object.
missing: whether the item is present in config$envir
hash: the object's fingerprint.
mtime: file modification time (NULL for non-files)

Additional namespaces for data objects that are also import/target build steps:

command: command from the drake_plan()
seed: RNG seed
trigger: the trigger from the drake_plan() (or the default trigger)
object_in: names of non-file dependencies
file_out: names of any file outputs (potentially more than one)
file_in: names of any non-knitr input files
knitr_in: names of any knitr input files
dependency_hash: The hash used to trip the depends trigger.
file_hash: The hash used to trip the file trigger.

All the namespaces that do not exist (yet) shall explicitly be set to NULL in order to avoid lookup errors.

wlandau · 2018-06-30T02:33:48Z

We should also maintain overall all_targets and all_imports lists in config.

kendonB · 2018-07-01T00:10:40Z

Forgive me if this is a dumb question, but what's the barrier to just using the plan data.frame for everything and just using R's standard [ lookups or dplyr::filter? These aren't nearly as fast as a hash table but they don't require building the table, which you would have to do for every make, I presume. I couldn't seem to find any benchmarks comparing storr_environment to something simple like filter but maybe something is out there.

ref for benchmarking [, filter and the data.table equivalents: https://appsilondatascience.com/blog/rstats/2017/03/02/r-fast-lookup.html

wlandau · 2018-07-01T20:40:30Z

There is no barrier, technically speaking. I am glad you checked with richfitz/storr#81. From these results, plain environments seem to win out in terms of lookup speed, and I just assumed storr_environment()s would offer similar speed and more convenience.

From richfitz/storr#81 (comment), I now think the right approach is to have config$meta be a plain new.env() with all the individual drake_meta() lists (augmented with dependency information). Lookup speed, code style, and my ability to solve #435 and #283 will all improve.

wlandau · 2018-07-04T12:44:06Z

Edit: existing drake_meta() info needs to be computed at the last minute, while dependency information can be resolved all at once up front. I'm thinking this lookup structure can just focus on being a more efficient version of config$graph, possibly even replacing config$graph at some point.

wlandau · 2018-07-04T13:06:32Z

This data structure is really a protocol for building the targets. Whereas the drake_plan() tries to be friendly to the user, the protocol tries to be friendly to the internals.

wlandau · 2018-07-04T13:38:15Z

To do: put an R6 wrapper around config$protocol with a bunch of safe accessor methods (would also help enforce inherit = FALSE in get()).

kendonB · 2018-07-04T19:39:20Z

Is it worth benchmarking get wrapped up in the R6 piece before going that way?

wlandau · 2018-07-04T22:12:20Z

Should be easy enough to check.

wlandau · 2018-07-14T12:15:52Z

Also, we really do need to think about situations where a graph is really handy: for example, prune_envir() with the "lookahead" strategy and an efficient outdated() look all the way downstream on the graph. Can the lookup object and priority queue really do the same thing?

wlandau · 2018-11-03T03:13:05Z

From #483, the complete dependency information of targets is neatly arranged in igraph attributes. With those attributes organized, I think we can revisit a lookup structure to cut back on the multitude of linear-time lookups in drake. Steps I will be taking to approach the problem:

Deprecate build_drake_graph() and replace it with drake_config(...)$graph.
Divide the old build_drake_graph() internals into create_drake_lookup() and create_drake_graph(), the latter of which will create the original graph (attributes and all) from the lookup structure.
In addition to the original igraph attributes, give the new lookup structure the target-specific elements of the plan. No more linear-time subsetting, e.g. plan$command[plan$target == "my_target]. We will be using constant-time operations like get(x = "my_target", envir = config$lookup, inherit = FALSE)$command from now on.
Once those lookups are taken care of, take the extra attributes out of the original igraph.

wlandau · 2018-11-03T03:20:34Z

I do not intend this new lookup data structure to replace the graph, at least at first, but it will be a useful supplement for speeding things up.

wlandau · 2018-11-03T03:47:48Z

I plan to encapsulate the lookup operation internally.

get_lookup <- function (target, field, config){
  info <- get(
    x = target,
    envir = config$lookup,
    inherit = FALSE
  )
  get(
    x = field,
    envir = info,
    inherit = FALSE
  )
}

Here, info is just a pointer (an environment). The two-part dereferencing may create a lot of cache misses (L1-L3 caches) but that's almost certainly not the bottleneck here.

Also, the lookup should be a read-only data structure. possible names:

lookup
specification
register
registry
roster

Leaning towards specification at this point, with a function get_spec() and no accompanying set_spec().

wlandau · 2018-11-03T16:31:52Z

Before diving headlong into this refactoring, I want to do more in-depth profiling. The magrittr pipe tends to complicate profvis output, so a first step is to completely clean out %>%.

wlandau · 2018-11-04T12:54:39Z

Thanks to #571, profvis now delivers useful results. #572 accounts for half the target-level overhead we see in drake_example("overhead"). Given the amount of work and care it would take to solve #440, I am postponing it again until it becomes an actual bottleneck.

wlandau · 2018-11-07T22:38:24Z

See #575.

wlandau · 2018-11-07T22:59:11Z

This feature may not reduce overhead, but it should simplify the architecture.

wlandau · 2018-11-08T04:35:39Z

Just started the work here.

wlandau · 2018-11-21T15:03:13Z

New favorite term for this structure: "layout". I also thought about "topology", but it seems more redundant with "graph" and implies stronger-than-necessary mathematical claims.

wlandau · 2018-11-26T22:09:14Z

Fixed via #576.

wlandau · 2019-12-16T01:17:54Z

On reflection, the term "layout" is not quite right. Let's go with "workflow specification", or "spec" for short. The spec is drake's interpretation of the plan. The drake plan is user-friendly, and all the dependency relationships are implicit. The spec is machine-friendly, and all the dependency relationships are explicit. drake_config() takes the plan and makes the spec by doing static code analysis on the commands and functions (i.e. analyze_code()).

The name change is implemented in 322785e.

wlandau added difficulty: advanced topic: performance component: internals labels Jun 29, 2018

wlandau mentioned this issue Jun 29, 2018

Huge overhead in adjacent_vertices in a moderately sized project #435

Closed

wlandau added the status: priority label Jun 29, 2018

This was referenced Jun 30, 2018

Multiple output files per target #350

Closed

Allow multiple output files for each command #283

Closed

wlandau removed the topic: internals label Jun 30, 2018

wlandau mentioned this issue Jun 30, 2018

Consider being more selective in save() in prepare_distributed in run_future_lapply #436

Closed

wlandau added status: priority branch: i440 👷 🔧 and removed status: priority labels Jun 30, 2018

wlandau mentioned this issue Jun 30, 2018

Send less data to remote transient workers #434

Closed

kendonB mentioned this issue Jul 1, 2018

Question: Do you have benchmarks for driver_environment vs dplyr::filter? richfitz/storr#81

Closed

kendonB mentioned this issue Jul 3, 2018

"future" parallelism takes a really long time to get going when many upstream targets are up to date #448

Closed

wlandau mentioned this issue Jul 4, 2018

Consider removing or optionalizing the gc() in this while loop #454

Closed

wlandau pushed a commit that referenced this issue Jul 4, 2018

Start on #440

3112ee7

wlandau mentioned this issue Jul 4, 2018

Initial work on a fast-lookup build protocol #457

Closed

7 tasks

wlandau added status: may revisit and removed status: priority branch: i440 🔧 labels Jul 5, 2018

wlandau added the status: may revisit label Jul 14, 2018

wlandau mentioned this issue Jul 28, 2018

Embrace igraph vertex attributes for storing code analysis information #483

Closed

wlandau removed the status: may revisit label Aug 19, 2018

wlandau self-assigned this Nov 3, 2018

wlandau reopened this Nov 3, 2018

wlandau mentioned this issue Nov 3, 2018

Stop using the magrittr pipe in the internals #570

Closed

wlandau closed this as completed Nov 4, 2018

wlandau added the status: may revisit label Nov 4, 2018

wlandau reopened this Nov 7, 2018

wlandau mentioned this issue Nov 7, 2018

Bare-bones scheduling and caching #575

Closed

wlandau removed the status: may revisit label Nov 8, 2018

wlandau pushed a commit that referenced this issue Nov 8, 2018

Sketch #440

3974d12

wlandau added the status: priority label Nov 8, 2018

wlandau mentioned this issue Nov 9, 2018

Intermediate data structure to store all the code analysis #576

Merged

6 tasks

wlandau closed this as completed Nov 26, 2018

wlandau mentioned this issue Dec 8, 2018

Some internal cleanup: custom plan columns and the timeout argument #607

Merged

6 tasks

wlandau mentioned this issue Dec 16, 2019

Clean up workflow spec #1114

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new internal fast-lookup data structure for target information #440

A new internal fast-lookup data structure for target information #440

wlandau commented Jun 29, 2018

wlandau commented Jun 29, 2018 •

edited

Loading

wlandau commented Jun 30, 2018

kendonB commented Jul 1, 2018

wlandau commented Jul 1, 2018 •

edited

Loading

wlandau commented Jul 4, 2018

wlandau commented Jul 4, 2018

wlandau commented Jul 4, 2018

kendonB commented Jul 4, 2018

wlandau commented Jul 4, 2018

wlandau commented Jul 14, 2018

wlandau commented Nov 3, 2018 •

edited

Loading

wlandau commented Nov 3, 2018 •

edited

Loading

wlandau commented Nov 3, 2018 •

edited

Loading

wlandau commented Nov 3, 2018

wlandau commented Nov 4, 2018

wlandau commented Nov 7, 2018

wlandau commented Nov 7, 2018

wlandau commented Nov 8, 2018

wlandau commented Nov 21, 2018

wlandau commented Nov 26, 2018

wlandau commented Dec 16, 2019

A new internal fast-lookup data structure for target information #440

A new internal fast-lookup data structure for target information #440

Comments

wlandau commented Jun 29, 2018

Lessons from #435 and #283:

Explanation

Solution

wlandau commented Jun 29, 2018 • edited Loading

wlandau commented Jun 30, 2018

kendonB commented Jul 1, 2018

wlandau commented Jul 1, 2018 • edited Loading

wlandau commented Jul 4, 2018

wlandau commented Jul 4, 2018

wlandau commented Jul 4, 2018

kendonB commented Jul 4, 2018

wlandau commented Jul 4, 2018

wlandau commented Jul 14, 2018

wlandau commented Nov 3, 2018 • edited Loading

wlandau commented Nov 3, 2018 • edited Loading

wlandau commented Nov 3, 2018 • edited Loading

wlandau commented Nov 3, 2018

wlandau commented Nov 4, 2018

wlandau commented Nov 7, 2018

wlandau commented Nov 7, 2018

wlandau commented Nov 8, 2018

wlandau commented Nov 21, 2018

wlandau commented Nov 26, 2018

wlandau commented Dec 16, 2019

wlandau commented Jun 29, 2018 •

edited

Loading

wlandau commented Jul 1, 2018 •

edited

Loading

wlandau commented Nov 3, 2018 •

edited

Loading

wlandau commented Nov 3, 2018 •

edited

Loading

wlandau commented Nov 3, 2018 •

edited

Loading