-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle plugins being terminated correctly #3539
Handle plugins being terminated correctly #3539
Conversation
I have one more cleanup, removing the plugin array from the hook call altogether and rely solely on the |
4e6abbe
to
61ab064
Compare
Caught one more memory-leak :-) |
b2f1fe5
to
229202e
Compare
@@ -133,9 +144,21 @@ static struct command_result *plugin_start(struct dynamic_plugin *dp) | |||
/* Give the plugin 20 seconds to respond to `getmanifest`, so we don't hang | |||
* too long on the RPC caller. */ | |||
p->timeout_timer = new_reltimer(dp->cmd->ld->timers, dp, | |||
time_from_sec((10)), | |||
time_from_sec((20)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't 10 seconds long enough ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation says 20 seconds, and this was not matching, so I adapted the value to the documented one. We can also go the other way around :-)
I think I just forgot to update it last time i lowered the timeout ^^ But that's really a nit !
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
Le vendredi, février 21, 2020 4:46 PM, Christian Decker <notifications@github.com> a écrit :
… @cdecker commented on this pull request.
In lightningd/plugin_control.c:
> @@ -133,9 +144,21 @@ static struct command_result *plugin_start(struct dynamic_plugin *dp)
/* Give the plugin 20 seconds to respond to `getmanifest`, so we don't hang
* too long on the RPC caller. */
p->timeout_timer = new_reltimer(dp->cmd->ld->timers, dp,
- time_from_sec((10)),
+ time_from_sec((20)),
The documentation says 20 seconds, and this was not matching, so I adapted the value to the documented one. We can also go the other way around :-)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.[https://github.com/notifications/beacon/AFLK3F7277I2HJNOEBFNS4TRD7ZLRA5CNFSM4KX4C4LKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCWPLOQY.gif]
|
/* Call next will unlink, so we don't need to. This is treated | ||
* equivalent to the plugin returning a continue-result. | ||
*/ | ||
plugin_hook_callback(NULL, NULL, NULL, link->req); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so this makes plugin_hook_callback
call the callback with NULL
so it thinks actually no plugin was registered for this hook and can continue operations ? This is neat as we don't crash as before, but don't we want to log broken or unusual here so we know the plugin actually crashed, and for which call ?
Also, since it crashed don't we want to unregister it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get called due to plugin_killed
being called or because we tal_free(plugin)
which itself prints a warning to the logs and cleans up / unregisters the plugin from all hooks, hook-calls, and other things they might have registered for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 229202e
We were waiting for both stdin and stdout to close, however that resulted in us deferring cleanup indefinitely since we did not poll stdout for being writable most of the time. On the other hand we are almost always polling the plugin's stdout, so that notifies us as soon as the plugin stops. Changelog-Fixed: plugin: Plugins no longer linger indefinitely if their process terminates
Changelog-Fixed: plugin: A crashing plugin will no longer cause a hook call to be delayed indefinitely
We make the current state of `lightningd` explicit so we don't have to identify a shutdown by its side-effects. We then use this in order to prevent the killing and freeing of plugins to continue down the chain of registered plugins.
We are attaching the destructor to notify us when the plugin exits, but we also need to clear them once the request is handled correctly, so we don't call the destructor when it exits later.
We promised we'd be waiting up to 20 seconds, but were only waiting for 10. Fix that by bumping to the documented 20.
It was a pointer into the list of plugins for the hook, but it was rather unstable: if a plugin exits after handling the event we could end up skipping a later plugin. We now rely on the much more stable `call_chain` list, so we can clean up that useless field.
229202e
to
acaec67
Compare
Rebased on top of |
Due to a bug in our plugin cleanup logic we'd be waiting for both the plugin's
stdout
andstdin
to get closed before cleaning it up. However, since weonly poll
stdout
for incoming messages from the plugin, and not poll itsstdin
, we'd defer cleaning up indefinitely, until we either send somethingto the plugin or we shut down the node entirely. This means that we'd also not
detect crashes in a reasonable time, and a plugin crashing while handling a
hook event could hang forever.
This PR first demonstrates this issue using a plugin that exits as soon as it
htlc_accepted
hook is called, and then proceeds to fix the issue. We takethe closing of a plugin's
stdout
as the sole signal that the plugin isexiting and trigger the cleanup immediately. This then surfaced a number of
issue where we had memory either lingering around or the
tal_free
orderbeing incorrect, resulting in a number of use-after-free issues. So I had to
dive in and clean things up a bit.
In order to facilitate skipping a crashed plugin I changed the call chain for
hook events to be a list instead of an array. Each hook event now has a list
of plugins that are still to call, and we pop off elements as we receive
responses or plugins exit.
Another issue was that we'd now be falsely detecting a node shutdown during
hook calls as the plugin exiting spontaneously. To facilitate these back out
cases in future I added a state variable that indicates whether we are
operational or we are shutting down. This used to be detected through a number
of side-effects that weren't well documented (variables being set to
NULL
etc), so making this explicit should make this clearer.