Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider a user-visible build command #157

Closed
kendonB opened this issue Nov 14, 2017 · 13 comments
Closed

Consider a user-visible build command #157

kendonB opened this issue Nov 14, 2017 · 13 comments

Comments

@kendonB
Copy link
Contributor

kendonB commented Nov 14, 2017

I'm still trying to improve my workflow with many different calls to make.

Using skip_imports = TRUE, trigger = "missing", graph = my_graph, improves speed somewhat but drake still does a lot of work before getting started on building targets.

I propose a different building command that doesn't bother at all with any checking and builds unconditionally. This would more literally be @wlandau-lilly's proposal to advertise drake as a cool and flexible job scheduler.

It would:

  • Build the graph for the selected targets only (or the user passes a prebuilt graph which gets pruned)
  • Build the selected targets unconditionally and just fail if dependencies are missing - the user can choose to build dependencies.
  • Hash dependencies upon successful building so that drake::make knows those targets are up to date.
  • Decide what order to build stuff in.
@wlandau-lilly
Copy link
Collaborator

I do want drake to have that kind of fast functionality. On the other hand, drake is almost there, so I am resistant to an extreme solution that duplicates major, sensitive parts of the code base. Also, no matter what direction we go, I think we need to understand the remaining bottlenecks first. @kendonB, since I do not have a project as large as yours, I will need your help with that. I recommend working with the internal functions directly, some of which are actually exported. For example, how fast is make_imports() or drake_config() on your workflow? Once you have a config list, how fast is run_parallel_backend()? what about run_parallel(), or even an individual parallel_stage()?

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 14, 2017

@kendonB your proposed solution will probably work if meta_list() is still a bottleneck. In that case, it will probably be better to just have a rush argument to make() that forces meta_list() to be skipped entirely. If it isn't, then there is probably some other unknown bottleneck, and we will need an entirely different solution altogether.

@wlandau-lilly
Copy link
Collaborator

As we're looking into this, it would help to distinguish drake's prep work from future_lapply(). In downsized versions of my old projects, the future-powered SGE example is actually a lot slower than for other forms of parallelism even though all the checking is the same.

@wlandau-lilly
Copy link
Collaborator

Another point: the graph must contain imports and the imports must be processed for the targets to be up to date. If we go any faster, we may lose reproducibility.

@wlandau-lilly
Copy link
Collaborator

Another possible bottleneck: next_targets(), which we need for executing targets in the correct order. @kendonB, I believe you pointed that out as a slow point before, and at at the time, we could find a faster replacement. Unfortunately, we will always need it in parallel_stage().

@kendonB
Copy link
Contributor Author

kendonB commented Nov 14, 2017

These are the bottlenecks I see when running make on targets that are up to date with skip_imports = TRUE, trigger = "missing", graph = my_graph

drake_config() #7.78 seconds
  # inside drake_config:
  quick_inventory() #: 5 seconds
  check_drake_config() #: 2.275 seconds
  
  # in prepare_distributed:
  attempts <- outdated(config = config, make_imports = !config$skip_imports) # 5 seconds
  #in run_parallel
  quick_inventory() #: 5 seconds

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 14, 2017

Useful to know, thanks @kendonB!

The inventories actually make things faster because they avoid repeated calls to cache$list() for storr caches. Nothing more we can do there.

I will look into the call to outdated(), but I think it's also necessary. For usual make()'s, it is actually responsible for making the imports.

@wlandau-lilly
Copy link
Collaborator

On second thought, I probably can speed this up a bit. We can borrow inventories across storr namespaces, and we can add an option to skip the safety checks. I will give it a try.

@wlandau-lilly
Copy link
Collaborator

Facepalm moment: we don't even need inventories if we have cache$exists()! I can't believe I missed that! Anyway, drake should be a LOT faster in the next round of commits. Stay tuned.

@wlandau-lilly
Copy link
Collaborator

@kendonB I am excited for you to try a28cfeb. I totally dispensed with quick_inventory() and other inventory functions, and I added a skip_safety_checks argument to make() so you can skip check_drake_config().

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Nov 15, 2017

You may also be interested in 46e8b73. There, I added a new user-side make_from_config() make_with_config() function to do the work of make() on an existing configuration list from drake_config().

load_basic_example()
# Same as make(my_plan, ...):
config <- drake_config(my_plan)
make_with_config(config)

@kendonB
Copy link
Contributor Author

kendonB commented Nov 15, 2017

make runs much faster now!

I'm not sure I can think of any use case for make_with_config though? Recall that I only need many different calls to make because I need to switch future::plan configurations for different sets of targets, thus requiring a different config object for each set.

@wlandau-lilly
Copy link
Collaborator

Glad to hear it!

With the latest bottlenecks identified, I am even more convinced that a version of make() with unconditional builds is the same as the "missing" trigger: both assume meta_list() is the bottleneck. So until meta_list() is rate-limiting, I am closing this issue. I think we can continue to find and eliminate bottlenecks as they arise, wherever they happen to be, without duplicating sensitive parts of the code. Drake's code base is getting too large as it is, I counted 9107 lines of code in R/.

By the way: build() is already exported, but it only builds a single target.

Given make()'s massive argument list, make_with_config() may be convenient for some folks, so I am keeping it.

Over the last month or so, you have raised 15 issues and submitted one pull request, and drake is a much better package because of it. There is still a contributor spot waiting for you whenever you want it. On a related note, I think you are getting much more familiar with drake's internals, so if you're up for it, I encourage you to use pull requests more often. Very soon, I will have trouble keeping up with incoming issues: I will be on vacation all of next week, and drake is about to get a lot more scrutiny via #147 and ropensci/software-review#156. I expect even more people to chip in if the reviews are successful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants