-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anormal CPU consumption #2977
Comments
Ok it did not happened to me only. I have good tracing tool, will eventually figure out the issue, but there is a bug somewhere in 0.7.2 |
@rustyrussell @cdecker @ZmnSCPxj here are some flamegraph I took: |
I tested again with 0.7.1, no problem. In prod, 0.7.2 does not show this problem, seems to be only regtest 0.7.1. |
Can you also make flamegraph for 0.7.1? Thanks. The flamegraphs for 0.7.2 seem to blame something called |
@ZmnSCPxj it appears relevant https://aois.blob.core.windows.net/public/3.svg Note that on |
Ouch. I have no idea what |
@ZmnSCPxj unsure if it can help you our dockerfile is the same as yours outside stuff for tracing https://github.com/btcpayserver/lightning/blob/master/Dockerfile |
Might it be the fact we run on alpine... but strange the problem was not here before. And no idea about |
I think there was a recent thing where we had to close fds when starting plugins, lemme try to dig that... |
Could you check before and after this commit to see if this introduced the slowdown? 25ddb80 If not then maybe bisect from there, haha... |
@ZmnSCPxj or maybe I can just remove stack clearing with some magic compile flag? |
testing without the commit |
@ZmnSCPxj that's an incredible nose for smelling bugs that you have. Reverting this commit fix the issue. But why I can't see a problem in prod? Maybe because my node has no connection. |
No,
Thank you very much. Principle is new code is always suspicious. The other clue is that the flamegraph partially blames
I am unsure. The code is correct functionally. Time to bring in @rustyrussell . It could be that we are now doing The Right Thing but we erroneously did not do this before. Maybe the extra CPU load is fine? |
I have gone back to 0.7.1 for our tests. I keep 0.7.2 in production, will report if user complains. |
Will check the value of |
so I tried to debug with printf by logging before and after the call. Sadly nothing get printed. :( |
Does the wizard can take a look at the voodoo? or let me know anything I can do to help? |
Get me a replacement for my day job? Sorry, this is some rather low-level OS stuff. The code added in the commit is correct, from what I know of Linux and POSIX. The extra CPU load might be acceptable. We also do the same code when starting a subdaemon. The difference is now we do the Right Thing (TM) on a bunch of other calls, including |
mmh maybe another issue might be that the docker is on musl (fetching glib from another source). Might try to switch to another base image. |
The added loop is a loop on calls to We also now call into |
I tried moving the docker from alpine to stretch. Same issue. |
I finally took a look at the issue, and it seems like #include <stdio.h>
#include <unistd.h>
int main() {
int max = sysconf(_SC_OPEN_MAX);
printf("_SC_OPEN_MAX %d\n", max);
return 0;
} This returns the following:
So no wonder that got a lot slower, we are issuing 1024x syscalls ( @NicolasDorier would that be an option? |
mmh I don't know, I can try. The big problem docker and permission is that images which use non-root users can't share data via volume sharing with other containers. The c-lightning container is sharing different directory:
Tor and Bitcoin containers are using their own non root user with different UID, and btcpay container is using root user. I don't want to make change to several different images to fix this situation, as this can be a can of worms. Fixing one image which also share data might make another image fail, which need me to fix etc... and I am myself not maintainer of some images. I can probably work around 2. by not relying on cookie auth for rpc authentication, and use a fixed password to RPC (we do this when the user select LND because LND cache the cookiefile value, and end up with bad data in cache if bitcoind reboot), I don't like this because all bitcoin RPC in BTCPay will have the same password.... We don't expose it on internet, but still I don't like it. I think 3. should not be a problem, because this btcpay container runs as root, it will still have access to lightning-rpc. For 1. I can fix it by using the same UID between Tor container and c-lightning. On top of this I will need to change the permissions of all files in the c-lightning data directory, because existing user will have the files owned by root. Maybe it is easy, maybe not. I can try... |
Frankly for all those problem I wonder if it is not just easier to cap the max variable to max 1024 |
That wouldn't really work: though unlikely, a process might open >1024 fds, since running under root allows us to do so, and then only closing the first 1024 would leave the remainder lingering. So far I only found one way that this works in docker reliably: run the container with |
We should never open more than 1024 file descriptors anyway, and under some situations, namely running as root or in docker, would give us huge allowances. This then results in a huge, unneeded, cleanup for subprocesses, which we use a lot. Fixes ElementsProject#2977
Ok, I found another variant that might work: limit ourselves to only ever open up to 1024 fds, so we don't have to |
We should never open more than 1024 file descriptors anyway, and under some situations, namely running as root or in docker, would give us huge allowances. This then results in a huge, unneeded, cleanup for subprocesses, which we use a lot. Fixes #2977
Thanks a lot! |
We should never open more than 1024 file descriptors anyway, and under some situations, namely running as root or in docker, would give us huge allowances. This then results in a huge, unneeded, cleanup for subprocesses, which we use a lot. Fixes ElementsProject#2977
The implemented workaround for this issue caused my node to break as it neared 1024 open channels. (Reported in #4868.) Rather than arbitrarily clamping down the limit on open file descriptors, couldn't the syscall slowdown have been avoided by making use of the |
We also support running on BSDs and MacOS, so we need to detect if we are being compiled targeting Linux. Probably a good idea to see if a similar syscall or function exists on those OSs as well. also grats on 1024 chans |
I don't know about Mac. |
When I develop for BTCPayServer, I use a docker-compose which run all the dependencies needed for a comfortable dev environment on regtest. (on 0.7.2)
In this docker-compose I have two clightning: A and B where A open channel to B and send payment to B. (never the reverse)
I noticed that
A
was taking 100% of my CPU while being idle. Everything seemed to works fine, just that the CPU was strangely high.I have not yet tried to update to 0.7.2 in production, so I don't know yet if this is a general problem, if this is specific to 0.7.2, or specific to regtest and my own setup.
I thought reporting it in case somebody could reproduce.
The text was updated successfully, but these errors were encountered: