Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing or optionalizing the gc() in this while loop #454

Closed
kendonB opened this issue Jul 4, 2018 · 7 comments
Closed

Consider removing or optionalizing the gc() in this while loop #454

kendonB opened this issue Jul 4, 2018 · 7 comments

Comments

@kendonB
Copy link
Contributor

kendonB commented Jul 4, 2018

It looks like the vast majority of the overhead time causing the problem in #449 is spent in this while loop:

https://github.com/ropensci/drake/blob/master/R/mclapply.R#L75-L78

If I remove the gc() I see substantial speed improvements, though the master process certainly still can't keep up with the workers for the up-to-date targets which obviously process fast. A better solution would be to do a round of preprocessing targets using the same strategy as outdated so that the first round of targets sent in fl_master are out-of-date.

@wlandau
Copy link
Member

wlandau commented Jul 4, 2018

cc @krlmlr. Ref: comment, #428, #429. Looks like we should use gc() more sparingly.

@wlandau
Copy link
Member

wlandau commented Jul 4, 2018

And since persistent workers check if targets can be skipped, this may be yet another issue improved with #440.

@wlandau wlandau closed this as completed in de5835f Jul 4, 2018
@wlandau
Copy link
Member

wlandau commented Jul 4, 2018

In de5835f, drake calls gc() after each target build and on exit. Please see how it works for you. We can always elect to call gc() less often, but I think de5835f may strike a nice compromise.

@kendonB
Copy link
Contributor Author

kendonB commented Jul 4, 2018

@wlandau de5835f is going to be even worse as it will call gc() for every up-to-date target.

@wlandau
Copy link
Member

wlandau commented Jul 4, 2018

Then 2002653 should be an improvement.

@kendonB
Copy link
Contributor Author

kendonB commented Jul 4, 2018

This is surely still going to be too much gc'ing for a lot of projects. gc'ing often takes 2-3 seconds and for projects with small targets this will add a lot of overhead.

Perhaps your original solution could be modified to detect when a worker is idle for 30 seconds and only gc under those conditions and only once per idling event. Now that I've thought about it this seems like the obvious answer.

@kendonB
Copy link
Contributor Author

kendonB commented Jul 4, 2018

Big improvements with 2002653 using future_lapply. Still took more than 25 mins to get through the up-to-date targets but I'm now I've been staring at the logs waiting for "target ..." to appear for 5 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants