Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anormal CPU consumption #2977

Closed
NicolasDorier opened this issue Aug 20, 2019 · 31 comments · Fixed by #3101
Closed

Anormal CPU consumption #2977

NicolasDorier opened this issue Aug 20, 2019 · 31 comments · Fixed by #3101
Assignees
Labels
voodoo Suspect user, issue or code has ancient curse

Comments

@NicolasDorier
Copy link
Collaborator

NicolasDorier commented Aug 20, 2019

When I develop for BTCPayServer, I use a docker-compose which run all the dependencies needed for a comfortable dev environment on regtest. (on 0.7.2)

In this docker-compose I have two clightning: A and B where A open channel to B and send payment to B. (never the reverse)

I noticed that A was taking 100% of my CPU while being idle. Everything seemed to works fine, just that the CPU was strangely high.

I have not yet tried to update to 0.7.2 in production, so I don't know yet if this is a general problem, if this is specific to 0.7.2, or specific to regtest and my own setup.

I thought reporting it in case somebody could reproduce.

@NicolasDorier
Copy link
Collaborator Author

Ok it did not happened to me only.

btcpayserver/btcpayserver#978

I have good tracing tool, will eventually figure out the issue, but there is a bug somewhere in 0.7.2

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented Aug 21, 2019

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented Aug 21, 2019

I tested again with 0.7.1, no problem. In prod, 0.7.2 does not show this problem, seems to be only regtest 0.7.1.

@ZmnSCPxj
Copy link
Contributor

Can you also make flamegraph for 0.7.1? Thanks.

The flamegraphs for 0.7.2 seem to blame something called _arch_local_irq_enable, I am uncertain if this is relevant or not.

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented Aug 21, 2019

@ZmnSCPxj it appears relevant https://aois.blob.core.windows.net/public/3.svg

Note that on 3.svg, most of the traces are inside bitcoin-cli, not inside lightningd, because it's mainly idle.

@ZmnSCPxj
Copy link
Contributor

Ouch. I have no idea what _arch_local_irq_enable is, it is nowhere in our codebase. Looks like digging through the libc and/or kernel is in order ....

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented Aug 21, 2019

@ZmnSCPxj unsure if it can help you our dockerfile is the same as yours outside stuff for tracing https://github.com/btcpayserver/lightning/blob/master/Dockerfile

@NicolasDorier
Copy link
Collaborator Author

Might it be the fact we run on alpine... but strange the problem was not here before. And no idea about __clone ?

@ZmnSCPxj
Copy link
Contributor

ZmnSCPxj commented Aug 21, 2019

__clone either means fork or pthread_create, and since we mostly use fork (and we use it a lot) it is likely just fork. Unless some library we use actually uses pthread_create underneath. And since __clone is likely to reset the stack for the new process that is likely the reason why you do not see it attributed properly as to who called the fork, because the information is erased by __clone clearing the stack.

I think there was a recent thing where we had to close fds when starting plugins, lemme try to dig that...

@ZmnSCPxj
Copy link
Contributor

Could you check before and after this commit to see if this introduced the slowdown? 25ddb80

If not then maybe bisect from there, haha...

@NicolasDorier
Copy link
Collaborator Author

@ZmnSCPxj or maybe I can just remove stack clearing with some magic compile flag?

@NicolasDorier
Copy link
Collaborator Author

testing without the commit

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented Aug 21, 2019

@ZmnSCPxj that's an incredible nose for smelling bugs that you have. Reverting this commit fix the issue. But why I can't see a problem in prod? Maybe because my node has no connection.

@ZmnSCPxj
Copy link
Contributor

ZmnSCPxj commented Aug 21, 2019

@ZmnSCPxj or maybe I can just remove stack clearing with some magic compile flag?

No, clone is a Linux kernel thing. You have to modify the kernel, not just C-Lightning.

@ZmnSCPxj that's an incredible nose for smelling bugs that you have.

Thank you very much. Principle is new code is always suspicious.

The other clue is that the flamegraph partially blames close from within the __clone call. And the commit in question adds a loop of close calls after fork but before exec.

Reverting this commit fix the issue. But why I can't see a problem in prod? Maybe because my node has no connection.

I am unsure. The code is correct functionally. Time to bring in @rustyrussell .

It could be that we are now doing The Right Thing but we erroneously did not do this before. Maybe the extra CPU load is fine?

@ZmnSCPxj ZmnSCPxj added the voodoo Suspect user, issue or code has ancient curse label Aug 22, 2019
@NicolasDorier
Copy link
Collaborator Author

I have gone back to 0.7.1 for our tests. I keep 0.7.2 in production, will report if user complains.

@NicolasDorier
Copy link
Collaborator Author

Will check the value of _SC_OPEN_MAX

@NicolasDorier
Copy link
Collaborator Author

so I tried to debug with printf by logging before and after the call.

Sadly nothing get printed. :(
Let me know if I can do something to help debug this.

@NicolasDorier
Copy link
Collaborator Author

Does the wizard can take a look at the voodoo? or let me know anything I can do to help?

@ZmnSCPxj
Copy link
Contributor

Get me a replacement for my day job?

Sorry, this is some rather low-level OS stuff. The code added in the commit is correct, from what I know of Linux and POSIX. The extra CPU load might be acceptable. We also do the same code when starting a subdaemon. The difference is now we do the Right Thing (TM) on a bunch of other calls, including bitcoin-cli, which I suspect is the main cause in your particular case. Ping @rustyrussell again as he has more OS-level knowledge I suspect.

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented Aug 28, 2019

mmh maybe another issue might be that the docker is on musl (fetching glib from another source).
I vaguely remember complains about musl on some low performance of some low level stuff like this.

Might try to switch to another base image.

@ZmnSCPxj
Copy link
Contributor

The added loop is a loop on calls to close, and I find it hard to imagine that such a simple syscall could have additional overhead from the C library. Still, it does not hurt to try.

We also now call into bitcoin-cli more often, and I also noticed that we recently greatly increased our running time in our Travis tests, so it seems to me likely, that these changes in total greatly increased our run time for calling into bitcoin-cli (more calls into bitcoin-cli, and more overhead per call due to closeing unused fds before calling into bitcoin-cli). So it might not be just specific to your environment.

@NicolasDorier
Copy link
Collaborator Author

I tried moving the docker from alpine to stretch. Same issue.

@cdecker
Copy link
Member

cdecker commented Sep 24, 2019

I finally took a look at the issue, and it seems like _SC_OPEN_MAX in docker images is huge:

#include <stdio.h>
#include <unistd.h>
int main() {
  int max = sysconf(_SC_OPEN_MAX);
  printf("_SC_OPEN_MAX %d\n", max);
  return 0;
}

This returns the following:

Context Value
Host (User) 1024
Host (root) 1024
Docker (root) 1048576
Docker (user) 1024

So no wonder that got a lot slower, we are issuing 1024x syscalls (close()) on each subprocess (in particular calling bitcoind for polling). So my suggested quickfix would be not to run as root in docker (might be a good idea anyway).

@NicolasDorier would that be an option?

@NicolasDorier
Copy link
Collaborator Author

NicolasDorier commented Sep 25, 2019

mmh I don't know, I can try. The big problem docker and permission is that images which use non-root users can't share data via volume sharing with other containers.

The c-lightning container is sharing different directory:

  1. The tor hidden_service directory to get the onion address
  2. The bitcoin data directory to get the cookiefile for authenticating on RPC
  3. The lightning-rpc that btcpay container is using

Tor and Bitcoin containers are using their own non root user with different UID, and btcpay container is using root user.

I don't want to make change to several different images to fix this situation, as this can be a can of worms. Fixing one image which also share data might make another image fail, which need me to fix etc... and I am myself not maintainer of some images.

I can probably work around 2. by not relying on cookie auth for rpc authentication, and use a fixed password to RPC (we do this when the user select LND because LND cache the cookiefile value, and end up with bad data in cache if bitcoind reboot), I don't like this because all bitcoin RPC in BTCPay will have the same password.... We don't expose it on internet, but still I don't like it.

I think 3. should not be a problem, because this btcpay container runs as root, it will still have access to lightning-rpc.

For 1. I can fix it by using the same UID between Tor container and c-lightning.

On top of this I will need to change the permissions of all files in the c-lightning data directory, because existing user will have the files owned by root. Maybe it is easy, maybe not. I can try...

@NicolasDorier
Copy link
Collaborator Author

Frankly for all those problem I wonder if it is not just easier to cap the max variable to max 1024

@cdecker
Copy link
Member

cdecker commented Sep 25, 2019

Frankly for all those problem I wonder if it is not just easier to cap the max variable to max 1024

That wouldn't really work: though unlikely, a process might open >1024 fds, since running under root allows us to do so, and then only closing the first 1024 would leave the remainder lingering.

So far I only found one way that this works in docker reliably: run the container with --ulimit nofile=1024. This is also supported in docker-compose files. I'm still trying to see if there is a way to build this limit into the docker image itself, since having everybody add that will lead to some confusion, but I think for your use-case that is a solution.

cdecker added a commit to cdecker/lightning that referenced this issue Sep 25, 2019
We should never open more than 1024 file descriptors anyway, and under some
situations, namely running as root or in docker, would give us huge
allowances. This then results in a huge, unneeded, cleanup for subprocesses,
which we use a lot.

Fixes ElementsProject#2977
@cdecker
Copy link
Member

cdecker commented Sep 25, 2019

Ok, I found another variant that might work: limit ourselves to only ever open up to 1024 fds, so we don't have to close() anything beyond that range.

rustyrussell pushed a commit that referenced this issue Sep 25, 2019
We should never open more than 1024 file descriptors anyway, and under some
situations, namely running as root or in docker, would give us huge
allowances. This then results in a huge, unneeded, cleanup for subprocesses,
which we use a lot.

Fixes #2977
@NicolasDorier
Copy link
Collaborator Author

Thanks a lot!

NicolasDorier pushed a commit to btcpayserver/lightning that referenced this issue Sep 29, 2019
We should never open more than 1024 file descriptors anyway, and under some
situations, namely running as root or in docker, would give us huge
allowances. This then results in a huge, unneeded, cleanup for subprocesses,
which we use a lot.

Fixes ElementsProject#2977
@whitslack
Copy link
Collaborator

The implemented workaround for this issue caused my node to break as it neared 1024 open channels. (Reported in #4868.)

Rather than arbitrarily clamping down the limit on open file descriptors, couldn't the syscall slowdown have been avoided by making use of the close_range syscall instead of hammering on close in a loop? close_range is somewhat new (first appeared in Linux 5.9), but the code can fall back to the existing (slow) loop in the case that close_range returns ENOSYS.

@ZmnSCPxj
Copy link
Contributor

We also support running on BSDs and MacOS, so we need to detect if we are being compiled targeting Linux. Probably a good idea to see if a similar syscall or function exists on those OSs as well.

also grats on 1024 chans

@whitslack
Copy link
Collaborator

close_range does exist on FreeBSD as well, albeit with no flags defined.

I don't know about Mac.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
voodoo Suspect user, issue or code has ancient curse
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants