-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU load of gearman worker and daemon #232
Comments
Operating system/distro/kernel versions? |
OS: linux |
I thought that the problem could be caused by the fact that my clients are connected to 2 gearman servers. So I've connected some of them to one gearman server, some clients to the other and some to both. It did not help - after a few hours it went mad again and the strace of the clients connected to one server only looked as the example above, just with one connection. Additionally, some clients are written in C++, some in PHP and both are affected so the problem must be in the gearman library or the daemon itself. |
I'm guessing you have, but, just to be clear, have you tried the straight up 1.1.18 release? Or any other releases?
You mean workers, I presume?
My workers are written in Perl, and the jobs are submitted via PHP or Perl. Fwiw, I'm not seeing this. I'm on CentOS 7 though, not Debian. |
I've built the 1.1.18 version and installed it. I'll let you know if it works or not. The problem takes usually a few hours or 1 day to show. |
Thanks for working through it @zacek . I had something similar a while back and found a path through the state machine that ended in a busy wait loop in some cases. I wonder if your problem is related to #230 .. do you know if any signals are being delivered to gearmand, like maybe systemd sending it a HUP? |
Hi @SpamapS, thanks for your feedback. I have several types of workers that deliver part of the application functionality. There is always one "manager" process that manages workers of one type - starts the workers, checks if they are still running (using signal 0 sent to worker), restarting missing workers and collecting and sending statistics that workers write to redis. The manager is not connected to gearman. When manager receives the SIGINT, SIGERM or SIGHUP, it sends the signal to all running workers and waits for all of them to finish. All workers do install signal handlers for SIGINT, SIGTERM and SIGHUP and if such signal arrives they finish their job (if any) and then exit. It is possible, that some manager is restarted and thus signals workers connected to gearman daemon while the other workers are running. I do not send any signals to the gearman daemon. I'm only regularly (in cron) using gearadmin to control the queue sizes. This is not sending any signals, AFAIK. But the problem you've reported is in the common libgearman code and could affect both daemon and workers. So my answer is yes, IMHO, it could be related. |
The stable version 1.1.18 has the same problem - after approx. 14 hours of normal operation the CPU is high again. |
I have to re-verify the 1.1.18 stable build - I've found out that some clients were running on another server and used still the latest version of libgearman. Now I've been running the stable 1.1.18 build for 16 hours and the problem did not appear yet. I'll check it again after the weekend and let you know. |
I've tested the 1.1.18 stable build for the whole week an the problem did not re-appear. That's fine but it doesn't prove anything. We are running 3 environments - alpha, staging and production. All of them were running the same version of gearman (1.1.18 + 6 commits) but the problem was detected in production only. We were not able to reproduce it anywhere else. After some time the problem appeared on alpha as well. I have full access there and could do some problem analysis. I've also built and tested the latest gearman version (master branch - 1.1.18 + 136 commits) but it did not help - still broken. I must admit - we do not know what the trigger is, staging is still ok (for several months), production not. We try to deploy the stable version to production with the next release. |
Thanks for all of the awesome data zacek. This might be a tough one to nail down, and all the data we can get, we'll take. |
Could this issue be related to https://bugs.launchpad.net/bugs/1176324 ? |
@esabol it's possible, but seems unlikely. That one was centered around an incorrect state machine handling of a race, which was corrected. |
I seem to be getting this problem as well, I have had the same issue for years but have just restarted nodes as needed, server does not seem to need restarting. For the most part it seems to happen after periods of high persistent load, sometimes it just clears itself up other times I have to restart the offending workers to clear it. The high CPU is also accompanied with high network load between server/workers, I did a pcap and its just stuck in a very fast loop. What I found odd though is the worker nodes that it can affect may be completely dormant and just waiting for jobs, by that I mean I can say for certain that there will be no jobs of the task that is assigned to workers on that node, but the high network traffic will still affect it. |
See this strace which demonstrates the problem: https://gist.github.com/gfto/474c25c3585fa20f24f0f95b9afa8120 The issue is that the client starts looping (the poll() is being interrupted) if it does not receive the answers to the two packets that it always sends in single response packet: It goes like this - the client always sends:
then it goes to sleep:
once it's woken up by the server it reads the answer but if the answer does not contain two responses it starts looping until it gets two responses in single packet. This is OK as response from the server:
This response causes looping in the client (and high cpu usage in the client and in the server):
See https://gist.github.com/gfto/474c25c3585fa20f24f0f95b9afa8120 it is from a normal worker that just waits for jobs. See how sometimes the cycle takes just a single iteration and sometimes how it loops until it gets the two response packets as single network packet. |
The following fixes the issue for good:
|
@gfto why does it only receive one response, does that not also indicate an issue on the server side sending the message, as its TCP we should not be seeing a loss of these messages ide have thought? |
@gfto wrote:
You mean the worker, not the client, right? Using client and worker interchangeably was confusing me for a minute there. 😄
It's not clear to me why sending a NOOP to a worker who has just sent a PRE_SLEEP message to the server would cause a problem. Why do you think it's freaking out?
Why does the worker require two responses in a single packet? I'm showing my ignorance here. Is this a requirement of the Gearman/worker protocol? If not, shouldn't the worker be able to handle this? Does this indicate a problem with the worker or rather the library used to implement the worker? Not saying the server (i.e., gearmand) shouldn't be able to handle this more gracefully as well. I'm just wondering if the issue is multi-faceted. What language is this worker implemented in? |
@tsmgeek seems to have also found that there's some connection between worker interactions and the server. I wonder if it's as simple as a worker and server sort of getting into a game of "give me a job, I'll sleep till you do" and then the server saying "Nope, no jobs" is misconstrued by the client to mean "I should send another GRAB JOB and the loop just goes on forever. @gfto seems to have a found that the opposite is also part of it.. the server says "I DO have jobs" and the client expects a JOB_ASSIGN but doesn't get it and sends a GRAB_JOB just as it's about to receive JOB_ASSIGN and now they're out of sync. These can probably be tested, but it feels like we might have a state machine race. Ugh. |
The client does receive all the responses but in different network packets. The second one triggers wake-up and then the client asks the server again and this leads to the issue at hand. I guess the server receives the first packet and responds then receives the second one and responds. Sometimes the server receives the two packets and processes them responding with a single packet that contains the two responses. At least that is what I'm see while strace-ing the client and the server (unfortunately the code is hard to follow, contains too many abstractions for my liking and it's just plain weird in the places where the bug is). |
It's exactly that, unfortunately. I've tried to fix the client and the server state machines but the way they're coded makes it pretty hard. I've stared at the code for several days and still can not understand it :/ See for example https://github.com/gearman/gearmand/blob/master/libgearman/worker.cc#L768 (simplified bellow):
or this https://github.com/gearman/gearmand/blob/master/libgearman/worker.cc#L830 The good news is that the patch fixes the biggest issue which just wastes CPU cycles in both server and the client. The bad news is that the fix uncovered other problems related to EAGAIN handling in server io routines. I'm working on fixing that. I'm lucky because the great QA guys I'm working with are able to reproduce the problems using small VM and multiple stress-ng processes which after running for some time cause gearmand to start spinning more and more (much less with the patch). |
By "client" I mean the worker (the one that calls gearman_worker_work() function). In my setup I have one worker per gearmand client that's why I'm using client interchangeably with worker.
Because it interrupts the poll() used for sleeping in the client (the PRE_SLEEP phase). Once the poll() it's interrupted (because there is data in the socket) the client/worker send again "GRAB_JOB/PRE_SLEEP" to the server and goes to sleep. The problem happens when the client receives the answer to only GRAB_JOB and then another answer to PRE_SLEEP which wakes the client again and the loop happens. See the strace. This is kind of OK:
This is a problem:
After the patch the poll() isn't getting interrupted. |
@gfto is there any way you can try out https://github.com/SpamapS/rustygear in your QA scenario? It's not 100% functionally equivalent to gearmand.. but if it can be slotted in.. it might tell us something. Or if you can contribute a test case (can just be a documentation file) that shows us how to reproduce, I can try that. I'm with you that the state machine is really hard to read and debug. I tried to make rustygear easier to read for that reason. |
OK, but is there any downside to removing that code? If I'm reading the code correctly, the intention here is to inform the worker that it shouldn't sleep because there are jobs waiting. So what happens if the server doesn't do that? The worker goes to sleep, right? And the job sits in the queue potentially longer until a worker (probably a different worker) grabs it. So this might increase job latency in some scenarios, right? I could live with that if it prevents this bug, but it seems like the ideal solution is to fix the race condition.... (Easier said than done.) I 100% agree that the |
I agree, @esabol that it doesn't make sense to leave the worker asleep if there are jobs. The whole point of PRE_SLEEP is basically to let the server know that it needs waking up if there are jobs. One thing is that this may also be an accident of the protocol. There's not a lot of flow control in this multiplexed protocol, and it's possible that when gearman's protocol was designed, the slowness of systems and networks just didn't allow this scenario to happen very often if ever. I've oft wondered if we should redesign gearman around websockets and this is at least a data point for trying out something like that. |
I'm running gearmand using:
To solve the issue reported here I ended up applying the following patch: https://georgi.unixsol.org/programs/gearmand-cpu-spinning.patch In my testing this reduced the CPU usage after multple OOM in the VM in the workers and gearmand at least 10 times. I know that adding usleep() is a nasty hack but in this case it works. Here is what the process is consuming during normal operation before any OOM killer action has taken place:
Here is the consumption of the idle gearmand service, without the applied fixes, after serious OOM killing:
And, finally, this is the consumption with the applied fix (again after after serious OOM kill action):
As can be seen, the gearmand service's CPU consumption still gets affected after OOM kills on the node, but the effect seems to be at least 10 times lighter. |
This is the command my server is running with. |
Neither of you mentioned whether or not these particular functions have new jobs constantly coming in or not. If there aren't, it doesn't make sense that the NOOP is coming out consistently, and it makes me think that there's a bug in the _peek function. Anyway, absent a _peek bug, I see clients behaving weirdly like this: ... The code for waiting for a packet after GRAB_JOB is hard to follow, but the only two ways it breaks out into PRE_SLEEP that I can tell is disconnection or receiving a NOOP when we were expecting JOB_ASSIGN or NO_JOB. I wonder if this patch, then would work by simply ignoring the NOOP while in this state and waiting for another packet (which should be a NO_JOB, and thus will force the actual PRE_SLEEP cycle to reset): index 21019b61..c6573f00 100644
--- a/libgearman/worker.cc
+++ b/libgearman/worker.cc
@@ -856,6 +856,7 @@ gearman_job_st *gearman_worker_grab_job(gearman_worker_st *worker_shell,
if (worker->job()->impl()->assigned.command == GEARMAN_COMMAND_NOOP)
{
gearman_log_debug(worker->universal, "Received NOOP");
+ break;
}
if (worker->job()->impl()->assigned.command == GEARMAN_COMMAND_JOB_ASSIGN or I'd love it if somebody with a reproducing environment could try that patch out. I also wonder if we can construct a functional test for this, as it would be good to confirm this is the sequence. |
On second thought don't try that patch, it makes clients lock up. Damn. I'll keep looking for this answer though. |
Ok I'll be honest here, debugging this is beyond unfun. If any of you can figure out how to do it without crippling gearmand or the clients (usleeps and removing NOOP-on-job-available are both crippling to workloads IMO) I'll be forever thankful. However, a few years back I set out to rewrite gearmand in Rust precisely because this weird mixed bag of C and C++ in gearmand is just impossibly complex. If anyone affected by this is feeling bold, please do see if rustygeard will work for you. I'll be quite happy to help. I created a Matrix chat room for Gearman (I'll be socializing on the mailing lists and website soonish) so come by and say hi and we'll see if we can't get something simpler going for you. https://matrix.to/#/#gearman:spamaps.ems.host |
Unfortunately i'm finding myself in a similar situation to this... High load on the instances running the daemon and workers. Setup: 3x gearmand ( currently 1.1.18 ) sat behind 1 GCP load balancer Some workers are numerous ( up to 6 with the same name ) some are singular. All workers are managed by systemd. Everything is centos7 based and is mainly up to date with packages. Oddly this only seemed to pick up about 1600 on the 21st - we think we've ruled out any possible changes for this in the application code however. I'm not sure if this is entirely related but... Noticing a segfault's occurring roughly about the same time the workers appear to be stuck in the request loop:
When querying the segfault's it's always in the same location (
regarding the redis persistence stuff - all 3 gearmand's have been using either the same remote redis instance, local only and no persistence at some point during the tests - this doesn't seem to have any bearing on this issue ... I was hoping it could be used as a message bus to solve the timeouts we're seeing when submitting the job's when the daemon itself is busy. Extra threads to help lower the load a little but the issue is still present Are there any other idea's or things to try here to try and get more about the why it started doing this ? |
Clearly, gearmand doesn't handle the scenario of workers crashing very well.
You're not using the CPAN module, Gearman::Worker? That's surprising to me. How did you implement a Perl worker using libgearman? I've never heard of anyone doing that.
Well, I implemented my workers using Gearman::Worker and Parallel::ForkManager. If you did the same, that would eliminate the segfault since it's pure Perl. Of course, you might just end up with a different error that still kills the worker. https://metacpan.org/pod/Gearman::Worker |
From what I can see its plumbed in via Gearman::XS and Gearman::mesh which appears to be hooking the libgearman stuff I'm not a writer of perl stuff so apologies if my answers in that area aren't on target We've been running further tests today and there's no change in the behaviour when 1.1.19 is used The only time there's been difference is when we switched up from 2 core AMD Milan to 8 core The issue is also present on intel based machines too |
More and more I'm convinced that this is a really hard timing problem
caused in part by the higher speed and lower latency of modern networks.
Gearman was originally designed when AMD Athlons and P4's were running on
100Mbit physical networks. I wonder if we should add some kind of governor
on the rate of NOOPs we'll send to any worker.
…On Tue, Mar 29, 2022 at 5:11 PM ant ***@***.***> wrote:
From what I can see its plumbed in via Gearman::XS and Gearman::mesh which
appears to be hooking the libgearman stuff
I'm not a writer of perl stuff so apologies if my answers in that area
aren't on target
We've been running further tests today and there's no change in the
behaviour when 1.1.19 is used
The only time there's been difference is when we switched up from 2 core
AMD Milan to 8 core
the load has gone away and it appears to be working "normally" with
minimal complaints
The issue is also present on intel based machines too
—
Reply to this email directly, view it on GitHub
<#232 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADS6YAEGWQXFPDTNLXWLW3VCOL33ANCNFSM4HCB62AA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@gfto wrote:
Should we consider merging this patch? It's not an optimal solution, but, if it prevents the gearmand server from freaking out, maybe it's better than nothing. |
The patch solves the issue, so in absence of a better fix I guess it's better to be merged. |
Could you submit a PR? |
@gfto wrote:
I did it for 1.1.20 and now works much better. But one subprocess still using 100% without any jobs |
Just noticed...
P.S. Additionaly you need to self-control is job already in progress, because suddenly you can catch the same job several times |
Huh? I don't understand what you wrote there, @easterism. Could you clarify what you meant? Your findings are very interesting. If you repeat the exact same procedure but using the patched version of 1.1.20, what happens? I'd also be interested in the same test procedure with the patch but leaving |
@esabol If it looks like offtop, I'm sorry... I just trying to find an answer why gearman use so much CPU without any jobs. This thread is only one.
Well... I use gearman v1.1.20, PHP wrapper to libgearman v2.1.0 and GearmanManager Case 1 H:example.com:1 At the same time inside worker receive and display job handler H:example.com:1 Case 2 Add 5sec sleep inside job from Case 1 and run it. Now you may see duplicate handler inside worker, like this: H:example.com:1 I wasted a day to find out why... and i have no clue ( |
I wouldn't say it's normal behavior at all. It's not something I see on my gearmand server, but my setup is a bit idiosyncratic. |
@esabol Could you share your gearmand? What version you use? Is it already binary? For what platform? |
@easterism asked:
gearmand 1.1.19.1 in an Ubuntu 16.04.6 LTS Docker container. I compiled it myself using the Dockerfiles provided in the repository. It's been running for 5 months straight and handles a couple hundred jobs per day. The thing I've noticed is that everyone (I think) who has reported this problem is using PHP for their workers. I use Perl for our workers. (Our clients are mostly PHP though.) The PHP library is very buggy, especially so for the release version as opposed to the GItHub repository. Multiple people have been begging the maintainer to release a new version for over a year, and there's been no reply. I recommend compiling it from the GitHub repo source and I would advise using the libgearman.so from the version of gearmand you've compiled instead of from whatever package manager you are getting it from. Oh, I also don't do any background jobs. My jobs are all foreground. What platform are you on? |
Bare Debian 9,10,11. Thank you for advise. For me is no reason to use foreground jobs at all. I use gearman client in web-apps and common case is: get data->process each of data->bring to client and do it fast |
On Sun, Oct 29, 2023, 1:56 AM easter ***@***.***> wrote:
What platform are you on?
Bare Debian 9,10,11. Thank you for advise. For me is no reason to use
foreground jobs at all. I use gearman client in web-apps and common case
is: get data->process each of data->bring to client and do it fast
This doesn't make much sense to me. Who submits the job? The client? How do
you get the result back to the client with background jobs?
I will try to compile from source, and make some tests with different php
… versions.
Thank you again.
—
Reply to this email directly, view it on GitHub
<#232 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADS6YBHILFJRWI2UMLOOPLYBYK4PAVCNFSM4HCB62AKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYGQYDGOBZGQ4A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@SpamapS Yes, the job submited by PHP client
No need to back result at all. You can choose:
|
Hi,
There is the problem in a nutshell:
Between 1st and 2nd
Every death loop I've seen started with two following GRAB_JOB_ALL and PRE_SLEEP reqs. |
@domel5 wrote:
That's some insightful analysis, @domel5 ! It makes sense. Do you have a suggestion for how to fix it? A PR would be most welcome! |
I have no clue at the moment, I'm not familiar with C++. It requires to debug what is going on after |
Hey, @domel5 thanks for bringing fresh eyes to this very old, very annoying issue. I think what you say is plausible. It's important to note that this code is in libgearman, and as such, that would suggest that only clients who use libgearman would cause this death spiral. I don't know if we have established that, but it could also be that all clients implement similar logic and thus it wouldn't be limited to libgearman. For the libgearman case though, it's also important to note that it's flagging So, if you're following this issue and you're not using libgearman in non-blocking mode, please speak up! But for those who are... I think that the assumption that the code is making is that any send_packet calls into the connection flush method, which is essentially a big state machine, and as such, there are many opportunities to return gearmand/libgearman/connection.cc Lines 887 to 915 in 611517e
This gets into send.c, which, frankly, is really complicated. I do recall @BrianAker telling me about how the hostile random socket protection works a long, long time ago, but, it has very few code comments. So, let's call that "a little Sus". If I'm reading libhostile correctly, it would print some things in stderr to suggest it had "hostilized" the server (I guess it has something to do with assuming the server is doing something evil). This would actually set But, just in case, affected users following along, have you seen this type of error in your stderr?: https://github.com/gearman/gearmand/blob/master/libhostile/function.c#L102-L105 So, assuming libhostile wasn't the culprit, the POSIX At that point the docs suggest one should call There's some kind of state bug in all of this. Why are we getting So, perhaps what might fix this is with a defensive measure. Perhaps workers should ignore I'll ponder on this and look at making a patch in the next few days. |
@SpamapS : All of my workers are implemented in Perl using p-alik's Perl library, and I've never encountered this (or any) problem. That's why it made sense to me that the problem would be in libgearman's worker code. |
Thanks Ed, Reading that code, it doesn't really implement a state machine, it reads when it expects a response, and errors if it gets something unexpected. So for instance, if it got a NO_JOB while in PRE_SLEEP, it would just fall through to handling it normally, which is to just send PRE_SLEEP again and mark the connection as sleeping. https://github.com/p-alik/perl-Gearman/blob/master/lib/Gearman/Worker.pm#L299-L317 That is what we want libgearman to do, so I'm more convinced that we should make libgearman resilient to this unexpected sequence(even though I'm still not sure why it's happening). |
After some time (hours) both my workers and the gearman daemon start using CPU even though workers are mainly in waiting loop. It is very hard to reproduce but I've compared the standard situation with the berserk mode and can share some details. My workers are connected to 2 gearman daemons.
Standard worker in waiting loop - it is waiting for a new job with timeout of 5 second. When no job arrives within this period, some stats are written and the worker calls gearman_worker_work() again etc. This is the strace of one loop:
the
clock_gettime
(lines 1 and 13) is called directly before and after the gearman_worker_work() method (lines 2 - 12). Descriptors 11 and 12 are tcp connections to 2 different gearman daemons (G1 and G2). As you can see, it works as expected:02: GRAB_JOB_ALL from G2
03: read response - failure (not ready yet)
04: wait for the response from G2
05: response from G2: NO_JOB
06-09: the same as 02-05 with G1
10 and 11: report sleep mode to G1 and G2
12: wait for new job with 5 second timeout (and it really timed out - no job arrived within 5 s)
13: see the
clock_gettime
is called exactly 5 seconds after step 12But later this changes. I did not find why and what the trigger is (if any) but this is the new strace in the berserk mode (here the G1 and G2 are descriptors 12 and 13, it is a different process):
As you can see, lines 1 - 4 are normal (communication with G1) but after line 5 (GRAB_JOB_ALL on G2) there is no
recvfrom(13, ...)
call (not waiting for response from G2, why?) and directly it goes to sleep mode (lines 6 for G1 and 7 for G2) and waits for any input on both connections. G2 sends back the NO JOB response (line 8) which causes that the poll() call on line 8 returns immediately (diff time between line 9 and 8) and it again starts with polling G1, sending request to G2, but again without waiting for response etc. This repeats several times and then finally the descriptor of G2 is read (line 46) returning accumulated NO JOB responses (60 characters) and it goes this way for some time until probably everything G2 has sent is read (last 72 bytes on line 3654) and poll call finally wait without CPU load for the timeout.In this mode, both the worker and gearman daemon use lots of CPU even though doing nothing because the gearman daemon G2 is called too often only to learn there is nothing to do. This is the summary comparison (10 seconds), in normal mode:
and in berserk mode:
The program works, there is no disruption but the CPU load on the gearman daemon server goes up with every worker that goes berserk. The workers use 1.5% CPU each (normally it is not measurable) and gearman daemon can go to several hundreds (multi CPU server).
The text was updated successfully, but these errors were encountered: