Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new internal fast-lookup data structure for target information #440

Closed
wlandau opened this issue Jun 29, 2018 · 24 comments
Closed

A new internal fast-lookup data structure for target information #440

wlandau opened this issue Jun 29, 2018 · 24 comments

Comments

@wlandau
Copy link
Member

wlandau commented Jun 29, 2018

Lessons from #435 and #283:

  • drake spends a lot of time repeatedly looking up target-level information it should already know.
  • The dependency network is inflexible.

Explanation

I think these and similar problems come from the limitations of drake's data structures. The plan is great for users, but lookup is slow, and it contains only indirect information about each target's dependencies. The graph has complete dependency information, but it does not have other metadata (igraph attributes are a pain), and traversing the whole graph for dependencies is slow.

Solution

I want to put all target-level metadata in a unified fast-lookup data structure. This new config$meta will replace individual outputs of drake_meta() flying around, and it will decrease reliance on the plan and graph. For fast lookup, I want to base this data structure on an environment. For convenience, I want it to have a storr-like API. Best of both worlds: a storr_environment() cache. (Thank you @richfitz for making it available!)

There is a lot of refactoring to do, but I think this will make drake a whole lot faster and give it a cleaner code base.

@wlandau
Copy link
Member Author

wlandau commented Jun 29, 2018

config$meta storr namespaces at the level of individual data objects:

  • name: character string, names of the data objects and/or build steps
  • is_build_step: logical, whether the name is also the name of an import/target build step
  • imported: logical, imported from the user's workspace/environment or generated with a drake command?
  • file: whether the item is a file or simply a generic R object.
  • missing: whether the item is present in config$envir
  • hash: the object's fingerprint.
  • mtime: file modification time (NULL for non-files)

Additional namespaces for data objects that are also import/target build steps:

  • command: command from the drake_plan()
  • seed: RNG seed
  • trigger: the trigger from the drake_plan() (or the default trigger)
  • object_in: names of non-file dependencies
  • file_out: names of any file outputs (potentially more than one)
  • file_in: names of any non-knitr input files
  • knitr_in: names of any knitr input files
  • dependency_hash: The hash used to trip the depends trigger.
  • file_hash: The hash used to trip the file trigger.

All the namespaces that do not exist (yet) shall explicitly be set to NULL in order to avoid lookup errors.

@wlandau
Copy link
Member Author

wlandau commented Jun 30, 2018

We should also maintain overall all_targets and all_imports lists in config.

@kendonB
Copy link
Contributor

kendonB commented Jul 1, 2018

Forgive me if this is a dumb question, but what's the barrier to just using the plan data.frame for everything and just using R's standard [ lookups or dplyr::filter? These aren't nearly as fast as a hash table but they don't require building the table, which you would have to do for every make, I presume. I couldn't seem to find any benchmarks comparing storr_environment to something simple like filter but maybe something is out there.

ref for benchmarking [, filter and the data.table equivalents: https://appsilondatascience.com/blog/rstats/2017/03/02/r-fast-lookup.html

@wlandau
Copy link
Member Author

wlandau commented Jul 1, 2018

There is no barrier, technically speaking. I am glad you checked with richfitz/storr#81. From these results, plain environments seem to win out in terms of lookup speed, and I just assumed storr_environment()s would offer similar speed and more convenience.

From richfitz/storr#81 (comment), I now think the right approach is to have config$meta be a plain new.env() with all the individual drake_meta() lists (augmented with dependency information). Lookup speed, code style, and my ability to solve #435 and #283 will all improve.

@wlandau
Copy link
Member Author

wlandau commented Jul 4, 2018

Edit: existing drake_meta() info needs to be computed at the last minute, while dependency information can be resolved all at once up front. I'm thinking this lookup structure can just focus on being a more efficient version of config$graph, possibly even replacing config$graph at some point.

wlandau pushed a commit that referenced this issue Jul 4, 2018
@wlandau
Copy link
Member Author

wlandau commented Jul 4, 2018

This data structure is really a protocol for building the targets. Whereas the drake_plan() tries to be friendly to the user, the protocol tries to be friendly to the internals.

@wlandau
Copy link
Member Author

wlandau commented Jul 4, 2018

To do: put an R6 wrapper around config$protocol with a bunch of safe accessor methods (would also help enforce inherit = FALSE in get()).

@kendonB
Copy link
Contributor

kendonB commented Jul 4, 2018

Is it worth benchmarking get wrapped up in the R6 piece before going that way?

@wlandau
Copy link
Member Author

wlandau commented Jul 4, 2018

Should be easy enough to check.

@wlandau
Copy link
Member Author

wlandau commented Jul 14, 2018

Also, we really do need to think about situations where a graph is really handy: for example, prune_envir() with the "lookahead" strategy and an efficient outdated() look all the way downstream on the graph. Can the lookup object and priority queue really do the same thing?

@wlandau
Copy link
Member Author

wlandau commented Nov 3, 2018

From #483, the complete dependency information of targets is neatly arranged in igraph attributes. With those attributes organized, I think we can revisit a lookup structure to cut back on the multitude of linear-time lookups in drake. Steps I will be taking to approach the problem:

  • Deprecate build_drake_graph() and replace it with drake_config(...)$graph.
  • Divide the old build_drake_graph() internals into create_drake_lookup() and create_drake_graph(), the latter of which will create the original graph (attributes and all) from the lookup structure.
  • In addition to the original igraph attributes, give the new lookup structure the target-specific elements of the plan. No more linear-time subsetting, e.g. plan$command[plan$target == "my_target]. We will be using constant-time operations like get(x = "my_target", envir = config$lookup, inherit = FALSE)$command from now on.
  • Once those lookups are taken care of, take the extra attributes out of the original igraph.

@wlandau
Copy link
Member Author

wlandau commented Nov 3, 2018

I do not intend this new lookup data structure to replace the graph, at least at first, but it will be a useful supplement for speeding things up.

@wlandau
Copy link
Member Author

wlandau commented Nov 3, 2018

I plan to encapsulate the lookup operation internally.

get_lookup <- function (target, field, config){
  info <- get(
    x = target,
    envir = config$lookup,
    inherit = FALSE
  )
  get(
    x = field,
    envir = info,
    inherit = FALSE
  )
}

Here, info is just a pointer (an environment). The two-part dereferencing may create a lot of cache misses (L1-L3 caches) but that's almost certainly not the bottleneck here.

Also, the lookup should be a read-only data structure. possible names:

  • lookup
  • specification
  • register
  • registry
  • roster

Leaning towards specification at this point, with a function get_spec() and no accompanying set_spec().

@wlandau
Copy link
Member Author

wlandau commented Nov 3, 2018

Before diving headlong into this refactoring, I want to do more in-depth profiling. The magrittr pipe tends to complicate profvis output, so a first step is to completely clean out %>%.

@wlandau
Copy link
Member Author

wlandau commented Nov 4, 2018

Thanks to #571, profvis now delivers useful results. #572 accounts for half the target-level overhead we see in drake_example("overhead"). Given the amount of work and care it would take to solve #440, I am postponing it again until it becomes an actual bottleneck.

@wlandau
Copy link
Member Author

wlandau commented Nov 7, 2018

See #575.

@wlandau
Copy link
Member Author

wlandau commented Nov 7, 2018

This feature may not reduce overhead, but it should simplify the architecture.

wlandau pushed a commit that referenced this issue Nov 8, 2018
@wlandau
Copy link
Member Author

wlandau commented Nov 8, 2018

Just started the work here.

@wlandau
Copy link
Member Author

wlandau commented Nov 21, 2018

New favorite term for this structure: "layout". I also thought about "topology", but it seems more redundant with "graph" and implies stronger-than-necessary mathematical claims.

@wlandau
Copy link
Member Author

wlandau commented Nov 26, 2018

Fixed via #576.

@wlandau
Copy link
Member Author

wlandau commented Dec 16, 2019

On reflection, the term "layout" is not quite right. Let's go with "workflow specification", or "spec" for short. The spec is drake's interpretation of the plan. The drake plan is user-friendly, and all the dependency relationships are implicit. The spec is machine-friendly, and all the dependency relationships are explicit. drake_config() takes the plan and makes the spec by doing static code analysis on the commands and functions (i.e. analyze_code()).

The name change is implemented in 322785e.

@wlandau wlandau mentioned this issue Dec 16, 2019
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants