-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gearmand failed to gracefully close connections under certain conditions #150
Comments
I'm working with the @hwang2014 on this, temporarily we autoremediate this issue with the following script:
However this is not desirable. This issue seems to only be present when gearmand is dealing with clients that are geographically apart, with a latency of ~40-50ms between them. We do not see this issue with gearmand and clients located in the same datacenter. while 40-50ms shouldn't be a concern, its a datapoint worth sharing. We're running gearmand on centos 6.6 |
Hi! This is definitely a huge problem if you're seeing it. I want to bring up two things:
Thanks for your patience and for using Gearman! |
Hi, the libevent RPM is libevent-1.4.13-4.el6.x86_64 fairly frequently, we got the following errors in gearmand log: ERROR 2017-12-16 00:31:18.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218 |
libgearman-server/io.cc:218 is in I'm not sure if it will help or not, but you can start gearmand with |
I'm seeing an issue strikingly similar to this. We've attempted increasing the number of threads and have reduced the logging level. Has anyone come across any potential fixes? |
What platform/distribution/kernel? What version of libevent? |
I guess we eliminated an effect of the issue in #102 but not the cause. |
If you can avoid using gearadmin status, you should. Increasing threads will not help, because it locks all of the internal structures to produce the status report. I've long thought we should add statsd support for gauges to gearmand so that it can just fire off status updates to an external system and not encourage people to use the admin protocol. However, this seems to be related to connection handling. I don't have the resources to debug this. However, if I had to guess, I'd say that clients and workers are disappearing or closing their connections violently in a way that is confusing libevent. For anyone experiencing this effect, please send your exact libevent and OS/kernel versions and your client libraries if possible, to help us track down the root cause. Thanks. |
Regarding to my experiences I would confirm this assumption. I don't think it depend on particular environment because the issue exists for a log time. [1] Unfortunately I couldn't simulate this behavior of [1] https://groups.google.com/forum/#!topic/gearman/nyvLh0ZhmvA |
We're running GearmanD in a Docker container running in host network mode. We're leveraging a phusion base image, which effectively means we're running on Ubuntu 16.04 LTS base. We've installed Here's an example of the Dockerfile:
You can run it with the following docker-compose setup:
|
Using |
I had this problem. I decided to put a time of 30s for tcp connections. Solved for me. |
That sounds like a good solution, @maxwbot. Just to be clear, what platform/distribution were you seeing this problem on? |
Linux: Centos 7 / Nagios |
We using gearmand 1.1.21.0 in a docker contianer which base on image docker.io/library/python:3.11.4 libev-libevent-dev/oldstable 1:4.33-1 all Gearmand start with "gearmand -t 10 --job-retries 1 --verbose DEBUG -p 4730" ERROR 2024-02-26 00:01:26.998647 [ 11 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221 Can anybody has similar issues and any good solution ? |
@huamxu wrote:
Could you try the suggestion in the following comment? You might have to do that both on the Docker host and inside the Docker container. If that doesn't solve the problem, try changing "-t 10" to "-t 0" in your gearmand arguments. It's not an ideal solution, but it worked for the person in #150 (comment). You could also try removing "--job-retries 1" from your gearmand arguments. |
Thanks esabol, I will have a try |
@esabol Do you or any body else know what is the "connection timeout" ? how to configure the timeout ? |
@huamxu : That's not the problem. Did you try both of the suggestions I (and others before me) made up above here? |
Yes we have tried. But problem still exist. Gearmand didn't receive anything after the job finished, from gearman plugin log they send the WORK_COMPLETE packet |
How long did that job take to execute? Do you have any firewalls or such between gearmand and the Jenkins plugin? The gearman plugin for Jenkins is just a worker. Gearman wasn't really designed with multi-minute jobs in mind, but the Jenkins plugin makes it work OK. Still, once you get the JOB_ASSIGN packet, no communication is going back and forth until the WORK_COMPLETE. So if your jobs are really long, and there's a NAT gateway, or stateful firewall at play, then those have timeouts, and they can be exceeded. So that's the first thing I'd check. You can make the connection alive with keepalives. The default time before one is sent is really long though, 2 hours:
I think you have a few options:
Anyway, the original problem, where threading seems to be causing CLOSE_WAIT buildup, is not what you seem to have. |
@SpamapS Thanks very much I will try the 3rd option to reduce tcp_keepalive_time |
@pythonerdog wrote:
That's good to know. Thanks! |
In my use case, there are multiple thousands of clients talk with two gearmand servers (v1.1.17).
we periodically run "gearadmin --status" on the servers to check job status. It is observed that sometimes (repeatable almost every day), "gearadmin --status" stucks. When we check connection stats, it shows up many CLOSE-WAIT connections:
ss -tan | awk '{print $1}' | sort | uniq -c
207364 CLOSE-WAIT
17738 ESTAB
64 LAST-ACK
14 LISTEN
1 State
4 TIME-WAIT
The temp workaround is to restart gearmand for recovery.
The text was updated successfully, but these errors were encountered: