-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment: Fork from existing worker process to optimize copy-on-write performance #2099
Conversation
lib/puma/cluster.rb
Outdated
@@ -337,8 +351,7 @@ def stop | |||
def stop_blocked | |||
@status = :stop if @status == :run | |||
wakeup! | |||
@control.stop(true) if @control | |||
Process.waitall | |||
sleep 0.2 until @status == :halt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? Purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Process.waitall
no longer completes here, because during a re-fork one of the parent's children is the promoted process which doesn't exit but instead becomes the new parent.
Instead, the halt status is set after the promotion process is complete and all other worker-children have terminated.
Why did I never know about SIGURG? :headdesk: I'm a little nervous about the way this restart is implemented, there are some random changes in here that I think will mess up restart state. That can be fixed with some test coverage ofc. This introduces a full hard-restart, though, right? Who is going to be willing to trade memory savings for 60-120 seconds of hard downtime? |
lib/puma/runner.rb
Outdated
@@ -64,30 +63,19 @@ def start_control | |||
control.min_threads = 0 | |||
control.max_threads = 1 | |||
|
|||
case uri.scheme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait this is possible? Can we do this is a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted this change to #2111.
lib/puma/launcher.rb
Outdated
@@ -184,7 +184,6 @@ def run | |||
when :exit | |||
# nothing | |||
end | |||
@binder.close_unix_paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reflects a bit of refactoring- this line caused the old parent to delete the Unix socket when shutting down during a promote operation, which in this PR needs to be skipped. The call here is made unnecessary by this change to Cluster#run, which includes a call to close_binder_listeners
, which also cleans up any Unix sockets - but is skipped during promote.
Some of this refactoring could be extracted to a separate PR without the promote logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #2112 for the extracted listener-closing refactoring part of this PR, which can be considered separately from the rest of the promote logic.
No, it just forks a new set of worker processes from one of the already-running workers, which is nearly instant, the app doesn't need to be reloaded from scratch at all. |
81624bd
to
8872e88
Compare
I completed a pass through to minimize the changes. It's now based off two smaller PRs (#2111, #2112) that have some related fixes extracted out, leaving the promote-specific logic in a single commit (8872e88) that's more concise. I've also added a simple integration-test ensuring the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the biggest problem with the approach is how someone would actually use it. I can't figure out how a production system would know when to issue the promote call. If we were going to do this, it would need to be automatic instead.
lib/puma/cluster.rb
Outdated
|
||
def promote | ||
Process.kill "URG", @pid | ||
Process.detach @pid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Process.detach should never be used. It's implementation is extremely bad as it spawns a Thread just to perform a wait in a loop.
I was thinking about testing this on CodeTriage with promotion after a few thousand requests |
52f40e9
to
7c978e8
Compare
I did another pass on the implementation to simply further and make the diff smaller. This approach leverages the existing phased-restart logic to cycle workers one at a time for copy-on-write improvements with minimal impact to a live site. |
@wjordan Are those 4 new commits you added to this branch required to make the original promote feature pass the tests? If not, could you break all 4 of those into separate PRs? |
So, we're also going to need a new "worker recycle" option to go with this feature. Essentially, after serving X requests, restart/refork all workers gradually. This option should only be possible with reforking, and it should only happen once. This is so that people don't use it to create their own e.g. |
Those extra commits were from #2123 which I rebased this PR on, I thought some of that work may have fixed some of the test failures in this PR. I've rebased this work off |
Deploying to CodeTriage production to give it a shot. As for automatic, I think what we should provide is probably an option to do this after a certain amount of requests have been processed (I'm not sure what # is required, but we'll see after testing in prod). What I'm curious about is if production traffic might expose that these gains are temporary. CoW in Ruby is not great right now. I'm wondering if benchmarking against the same route over and over exaggerates the possible gain because the amount of object allocation is predictable and probably very low. Prod loads will let us know the answer. |
So I tried it and worker 0 reforked? I think that's the opposite of what's supposed to happen? |
Codetriage is also maybe not a great app to test this on because it has such low memory usage to begin with. |
I think this is interesting enough that we can include it still as an "experiment" in 5.0 with the changes required as I described. |
lib/puma/dsl.rb
Outdated
# `preload_app` is not required when this option is enabled. | ||
# | ||
# @note Cluster mode only. | ||
def fork_worker(answer=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be false by default
Probably should also provide some guidance in docs about forking from a dirty state - connection pools, state, etc. |
Thanks @nateberkopec for spending time testing out this branch! I'll try and find time soon to do another pass on this PR based on the feedback.
Yeah, worker 0 shouldn't reload itself when sent the Here's the output I get when starting a test app and running Puma log[10074] Puma starting in cluster mode... [10074] * Version 4.3.1 (ruby 2.6.0-p0), codename: Mysterious Traveller [10074] * Min threads: 0, max threads: 16 [10074] * Environment: development [10074] * Process workers: 8 [10074] * Phased restart available [10074] * Listening on tcp://0.0.0.0:9292 [10074] Use Ctrl-C to stop [10074] * Starting control server on unix:///tmp/puma-status-1585851969025-10074 [10074] - Worker 0 (pid: 10123) booted, phase: 0 [10074] - Worker 1 (pid: 10133) booted, phase: 0 [10074] - Worker 2 (pid: 10138) booted, phase: 0 [10074] - Worker 3 (pid: 10143) booted, phase: 0 [10074] - Worker 4 (pid: 10149) booted, phase: 0 [10074] - Worker 5 (pid: 10155) booted, phase: 0 [10074] - Worker 6 (pid: 10159) booted, phase: 0 [10074] - Worker 7 (pid: 10168) booted, phase: 0 [10074] - Starting phased worker restart, phase: 1 [10074] + Changing to [DIR] [10074] - Stopping 10133 for phased upgrade... [10074] - TERM sent to 10133... [10074] - Worker 1 (pid: 10588) booted, phase: 1 [10074] - Stopping 10138 for phased upgrade... [10074] - TERM sent to 10138... [10074] - Worker 2 (pid: 10607) booted, phase: 1 [10074] - Stopping 10143 for phased upgrade... [10074] - TERM sent to 10143... [10074] - Worker 3 (pid: 10627) booted, phase: 1 [10074] - Stopping 10149 for phased upgrade... [10074] - TERM sent to 10149... [10074] - Worker 4 (pid: 10645) booted, phase: 1 [10074] - Stopping 10155 for phased upgrade... [10074] - TERM sent to 10155... [10074] - Worker 5 (pid: 10690) booted, phase: 1 [10074] - Stopping 10159 for phased upgrade... [10074] - TERM sent to 10159... [10074] - Worker 6 (pid: 10708) booted, phase: 1 [10074] - Stopping 10168 for phased upgrade... [10074] - TERM sent to 10168... [10074] - Worker 7 (pid: 10724) booted, phase: 1 What is the unexpected output you saw? |
NVM my comment - I got the output you got. |
e10d26b
to
e5fe1dc
Compare
Finished another pass on this feature. Changes:
|
Trigger refork for improved copy-on-write performance.
Eliminate `instance variable @notify not initialized` warnings
Thanks for your hard work here @wjordan - I'm really excited to test out this feature in the 5.0 beta. |
This is still rather experimental, but I thought I'd share this work in progress to share preliminary results and spark discussion on whether this optimization seems at all interesting to anyone else to pursue further.
Description
This PR adds a config option (
fork_worker
) to spawn new worker processes by forking an existing worker process instead of the master process. It also adds arefork
command (currently implemented via aURG
signal) which re-forks all processes in the cluster from this original worker.Because the existing worker has been serving live requests, forking new workers from this fully-loaded process can potentially reduce memory usage by further optimizing copy-on-write performance across processes, as a step beyond what the
preload_app!
directive accomplishes at initialization.Demonstration
The ideal use-case for this technique involves a Rack application that loads a lot of persistent (non-garbage-collected) content into memory at runtime, instead of at initialization-time where
preload_app!
is able to load it before the initial fork. (Think of dynamically generated data or code that loads only after particular user inputs, or an in-memory object cache, etc..)Here's a minimal Rack application which demonstrates the optimization:
To test and measure 'real' memory usage, compare the total pss across all workers with a good amount of request traffic, before and after re-forking cluster workers using the added
refork
command.Memory usage in this (obviously very contrived) example is reduced from ~5GB to ~1GB, or an ~80% reduction in total memory usage across 8 workers.
Obviously most real-world use-cases won't see such a dramatic difference, but this still might possibly be a useful optimization in some situations. I'll look forward to testing on some large Rack applications to see how useful this might be in practice.
Production demo
I ran a rough comparison test on a production Rails app running 40 workers on a single instance, measuring memory usage after serving traffic for 15 minutes after a restart (for 'Baseline'), then restarting workers and serving traffic for another 15 minutes (for 'Optimized'):
Baseline
Per-worker Usage (MB): 2170 Rss, 1530 Pss, 655 Shared
Optimized
Per-worker Usage (MB): 2168 Rss, 1177 Pss (-23%), 1041 Shared (+59%)
This demo shows a ~20% improvement in memory usage on this particular application. (Note this app is running Ruby 2.5, the compacting GC in Ruby 2.7 may enable even greater CoW efficiency.)