Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading installed C-Lightning package breaks running instance in potentially dangerous ways #4346

Closed
whitslack opened this issue Jan 21, 2021 · 9 comments
Assignees

Comments

@whitslack
Copy link
Collaborator

whitslack commented Jan 21, 2021

While you all are considering a fundamental redesign of C-Lightning's daemon organization, I would request that you please consider what happens when a C-Lightning installation is upgraded on disk while an instance of that installation is running.

It is customary in the Linux world for the system package manager to upgrade installed packages simply by replacing executable files on disk with newer versions, irrespective of whether those executables may currently be running. This is fine under POSIX semantics, as the mapped VMAs of any running processes will maintain references to the inodes of the replaced versions, preventing them from being deallocated on disk, even after their link counts drop to zero, until no more VMAs reference them. Thus, running daemons will continue running the old code until they are subsequently restarted. While this works just fine for most daemons, it fails spectacularly for C-Lightning, which spawns subdaemons at arbitrary times by execveing their executable images on disk without regard for whether those images are of the same (or compatible) version as their parent daemons.

One potential solution would be to use a "zygote" process in the manner of Chromium/Chrome. A zygote process starts once upon startup of the top-level process and has the sole purpose of forking child processes, which in turn subsequently differentiate into whatever subdaemon specialization is needed, notably without execveing anything on disk. However, this pattern may not be well-suited to C-Lightning, given its capability for dynamically loading and unloading plugins.

Another possibility would be for the top-level lightningd process to open every critical subdaemon executable file at startup, and then when spawning a subdaemon, the appropriate /proc/self/fd/<n> symlink should be execved rather than /usr/libexec/c-lightning/<whatever>. This would have the effect of executing whatever version of the subdaemon was installed at the time when lightningd was launched, irrespective of whatever version of the subdaemon might be installed at the time of spawning. The downsides are: this only works on Linux, and the process name of the subdaemon will then be some opaque number, though I think there may be some tricks to rewrite that for the benefit of ps, top, and friends. Spawning non-critical plugins could still be done by execveing whatever happens to be on disk at the time, with the understanding that the version of the plugin on disk may speak a different protocol than the parent process.

One last potential solution (really more of a punt) would be for subdaemons simply to handshake on a protocol version upon spawning. This wouldn't allow continued operation in the event of a version mismatch, but it would at least protect against accidental catastrophe due to internal protocol changes.

Impetus for this request? I have personally experienced my C-Lightning node dying upon attempting to spawn a new subdaemon after I had upgraded the installed package on disk. Note that this does not happen immediately upon upgrading the package on disk but will happen at some unpredictable time afterward. To mitigate this problem, I have hacked up the Gentoo ebuild for C-Lightning so that it cowardly refuses to upgrade an installation of C-Lightning if it notices that an instance may be running, but this is gross and runs contrary to the expectations of users, who expect that they can upgrade packages on disk and then restart their services afterward to switch to the new versions.

@ZmnSCPxj
Copy link
Contributor

ZmnSCPxj commented Jan 22, 2021

An idea would be for lightningd to hardlink (or if that fails, copy) the subdaemons and builtin plugins to a subdaemons/ directory in the $LIGHTNINGDIR. That way we have a sample of the plugins at startup. When starting, we erase the entire directory at once, then construct it before going into the rest of the code.

We could also have --version of each subdaemon and builtin plugin include a sha256sum of the concatenation of (or maybe a merkle directory tree of) their source code (git log -1 is not enough in a development env since we could have a modified version locally running) just to ensure that the expected versions are the same, in case of a race condition where an update is in a halfway state when the daemon is restarted.

@whitslack
Copy link
Collaborator Author

hardlink (or if that fails, copy) the subdaemons and builtin plugins to a subdaemons/ directory in the $LIGHTNINGDIR.

That would fail if $LIGHTNINGDIR is mounted noexec. Atypical but not unreasonable.

@ZmnSCPxj
Copy link
Contributor

ZmnSCPxj commented Jan 23, 2021

Well, plugins/ could be located in the $LIGHTNINGDIR, so we have some prior justification to requiring non-noexec mounts.

I feel this is a reasonably clean solution that does not require us to move to zygote/stemcell type.

On the other hand, with the vague whispers of Android restricting process-forking, it may be needed to put everything in a single process anyway, which would mean something very much like a zygote would be needed as well.

@rustyrussell
Copy link
Contributor

Yeah, I learned this the hard way, and I now shutdown before make install.

We could have subdaemons send their version string on startup, then restart lightningd (after a few seconds' sleep) if the version is wrong. If we restart and our own version hasn't changed, then simply fail. I think upgrading running instances is a better choice than trying to keep them going?

@whitslack
Copy link
Collaborator Author

Self-restarting is an interesting thought. One nice aspect is that the new process would retain the same PID as the old process, so system service managers (like OpenRC) wouldn't notice anything awry at any instant.

@manreo
Copy link
Contributor

manreo commented Mar 13, 2021

I think this is important, is there any update? Should I avoid the PPA and just install manually (so that I can be sure ubuntu does not decide to upgrade automatically) @rustyrussell

@whitslack
Copy link
Collaborator Author

@mrmanpew: I'm not that well-versed in Debian-like distros, but I would think you can pin/freeze an installed package so that it will not be upgraded unless unpinned/thawed.

@cdecker
Copy link
Member

cdecker commented Mar 17, 2021

We could also just add a protocol version (sha256 hash of the wire file?) to the init messages, so subdaemons can decide whether it matches their version, and throw a tantrum if it doesn't. That'd not be as seamless as the zygote or file descriptor option, but it'd be safe and noisy pointing the user to towards how to fix the issue (restart).

@whitslack
Copy link
Collaborator Author

whitslack commented Mar 17, 2021

Users would still need to freeze their installed version of C-Lightning so that the OS does not upgrade it automatically in the background. I can tell you, one of the greatest annoyances as a node operator is coming back to my machine only to notice that my node shit the bed half a day ago. Yes, I could use some sort of service babysitter that automatically restarts crashed services, but I'm reluctant to do that, as it would paper over problems.

@rustyrussell rustyrussell self-assigned this Mar 25, 2021
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Apr 8, 2021
You still shouldn't do this (you could get some transient failures),
but at least you have a decent chance if you reinstall over a running
daemon, instead of getting confusing internal errors if message
formats have changed.

Changelog-Added: lightningd: we now try to restart if subdaemons are upgraded underneath us.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: ElementsProject#4346
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Apr 16, 2021
You still shouldn't do this (you could get some transient failures),
but at least you have a decent chance if you reinstall over a running
daemon, instead of getting confusing internal errors if message
formats have changed.

Changelog-Added: lightningd: we now try to restart if subdaemons are upgraded underneath us.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: ElementsProject#4346
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Apr 16, 2021
You still shouldn't do this (you could get some transient failures),
but at least you have a decent chance if you reinstall over a running
daemon, instead of getting confusing internal errors if message
formats have changed.

Changelog-Added: lightningd: we now try to restart if subdaemons are upgraded underneath us.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: ElementsProject#4346
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants