gearmand failed to gracefully close connections under certain conditions #150

hwang2014 · 2017-12-13T19:42:41Z

In my use case, there are multiple thousands of clients talk with two gearmand servers (v1.1.17).
we periodically run "gearadmin --status" on the servers to check job status. It is observed that sometimes (repeatable almost every day), "gearadmin --status" stucks. When we check connection stats, it shows up many CLOSE-WAIT connections:

ss -tan | awk '{print $1}' | sort | uniq -c
207364 CLOSE-WAIT
17738 ESTAB
64 LAST-ACK
14 LISTEN
1 State
4 TIME-WAIT

The temp workaround is to restart gearmand for recovery.

evanrich · 2017-12-13T19:57:01Z

I'm working with the @hwang2014 on this, temporarily we autoremediate this issue with the following script:

#!/bin/bash
timeout 30 gearadmin --status
if [ $? -eq 124 ]
then
    TIMESTAMP=$(date)
    service gearmand restart
    echo "$TIMESTAMP: restarting gearmand due to 124 status" >> /tmp/gearmand_status.log
fi

However this is not desirable. This issue seems to only be present when gearmand is dealing with clients that are geographically apart, with a latency of ~40-50ms between them. We do not see this issue with gearmand and clients located in the same datacenter. while 40-50ms shouldn't be a concern, its a datapoint worth sharing.

We're running gearmand on centos 6.6

SpamapS · 2017-12-14T06:30:03Z

Hi! This is definitely a huge problem if you're seeing it. I want to bring up two things:

CentOS 6 is really old, released 6 years ago. I understand, it's hard to roll out new platforms, but CentOS 7 has been out for 3 years. So I'm hesitant to spend a ton of time chasing any bugs that are related to CentOS 6 versions of software.
What version of libevent are you using?
Can you send logs from your gearmand's? The higher the logging level, the better.
Do you have workers or clients that log their gearman interactions? Do you see a lot of failed connections from them?

Thanks for your patience and for using Gearman!

hwang2014 · 2017-12-16T02:39:41Z

Hi, the libevent RPM is libevent-1.4.13-4.el6.x86_64

fairly frequently, we got the following errors in gearmand log:

ERROR 2017-12-16 00:31:18.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:31:46.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:32:01.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:32:05.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:32:15.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:32:21.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:32:44.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:33:02.000000 [ 3 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:33:27.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:33:32.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:33:44.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:34:00.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:34:01.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:34:08.000000 [ 3 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:34:34.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:35:02.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218

esabol · 2017-12-22T02:39:27Z

libgearman-server/io.cc:218 is in _connection_read(). It might indicate that the client or worker is closing the connection to gearmand when gearmand doesn't expect it to. Maybe the problem is in either the implementation of the clients or workers? What language did you develop them with?

I'm not sure if it will help or not, but you can start gearmand with --verbose DEBUG or maybe --verbose TRACE (I forget which is higher) to get more detailed logs.

Strainy · 2018-02-06T04:14:50Z

I'm seeing an issue strikingly similar to this. We've attempted increasing the number of threads and have reduced the logging level. Has anyone come across any potential fixes?

esabol · 2018-02-06T04:55:27Z

I'm seeing an issue strikingly similar to this.

What platform/distribution/kernel? What version of libevent?

p-alik · 2018-02-06T06:49:02Z

I guess we eliminated an effect of the issue in #102 but not the cause.

SpamapS · 2018-02-06T15:08:51Z

If you can avoid using gearadmin status, you should. Increasing threads will not help, because it locks all of the internal structures to produce the status report. I've long thought we should add statsd support for gauges to gearmand so that it can just fire off status updates to an external system and not encourage people to use the admin protocol.

However, this seems to be related to connection handling. I don't have the resources to debug this. However, if I had to guess, I'd say that clients and workers are disappearing or closing their connections violently in a way that is confusing libevent.

For anyone experiencing this effect, please send your exact libevent and OS/kernel versions and your client libraries if possible, to help us track down the root cause. Thanks.

p-alik · 2018-02-06T16:53:27Z

I'd say that clients and workers are disappearing or closing their connections violently in a way that is confusing libevent.

Regarding to my experiences I would confirm this assumption.

I don't think it depend on particular environment because the issue exists for a log time. [1]

Unfortunately I couldn't simulate this behavior of gearmand with my client/worker implementations.

[1] https://groups.google.com/forum/#!topic/gearman/nyvLh0ZhmvA

Strainy · 2018-02-11T22:49:56Z

We're running GearmanD in a Docker container running in host network mode. We're leveraging a phusion base image, which effectively means we're running on Ubuntu 16.04 LTS base. We've installed libevent-2.0-5 as part of the image build.

Here's an example of the Dockerfile:

FROM phusion/baseimage:0.9.20

ARG GEARMAN_REPO=https://github.com/gearman/gearmand
ARG GEARMAN_VERSION=1.1.17

# Install GearmanD Dependencies
RUN apt-get update -qq && \
    apt-get install -y curl gcc make g++ libboost-all-dev gperf libevent-dev \
        uuid-dev libmysqld-dev libpq-dev libsqlite3-dev libhiredis-dev \
        unzip && \
    curl -L --silent -o /tmp/cyassl.zip http://www.yassl.com/cyassl-3.2.0.zip && \
    cd /tmp && \
    unzip -qq cyassl.zip && \
    cd cyassl* && \
    ./configure \
        --prefix=/usr && \
    make && \
    make install && \
    mkdir /tmp/gearman && \
    curl -L --insecure --silent ${GEARMAN_REPO}/releases/download/${GEARMAN_VERSION}/gearmand-${GEARMAN_VERSION}.tar.gz \
        | tar -xz --strip 1 -C /tmp/gearman && \
    cd /tmp/gearman && \
    ./configure \
        --prefix=/usr \
        --sysconfdir=/etc \
        --localstatedir=/var \
        --enable-hiredis && \
    make && \
    make install && \
    adduser --system --no-create-home --disabled-login --disabled-password --group gearman && \
    apt-get purge -y gcc make gperf libevent-dev uuid-dev \
        libmysqld-dev libpq-dev libsqlite3-dev libhiredis-dev unzip && \
    apt-get autoremove -y && \
    apt-get install -y telnet libmysqlclient20 libhiredis0.13 libevent-2.0-5 libpq5 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

EXPOSE 4730

ENTRYPOINT ["gearmand"]

You can run it with the following docker-compose setup:

version: '2.1'

services:

  gearmand:
    build:
      context: .
      dockerfile: Dockerfile.gearmand
    command:
      - --listen=0.0.0.0
      - --round-robin
      - --verbose=DEBUG
      - --log-file=/dev/stdout
      - --port=4730
      - --queue-type=MySQL
      - --mysql-host=mysql
      - --mysql-port=3306
      - --mysql-user=root
      - --mysql-password=password
      - --mysql-db=gearman
      - --mysql-table=gearman_1
    links:
      - mysql
    ports:
      - "4730:4730"
    depends_on:
      mysql:
        condition: service_healthy

  mysql:
    image: mysql:5.6
    healthcheck:
      test: ["CMD", "mysqladmin" ,"ping", "-h", "localhost"]
      interval: 20s
      timeout: 20s
      retries: 3
    environment:
      - MYSQL_ROOT_PASSWORD=password
      - MYSQL_DATABASE=gearman

sni · 2019-10-09T07:13:18Z

Using --threads=0 solves the issue for me. Had tons of CLOSE_WAIT connections after a couple of minutes and couldn't gracefully stop gearmand. But with disabling threads it runs quite smootly so far.

maxwbot · 2021-06-10T17:03:39Z

I had this problem. I decided to put a time of 30s for tcp connections. Solved for me.
echo 30 > /proc/sys/net/ipv4/tcp_tw_recycle

esabol · 2021-06-11T18:04:00Z

That sounds like a good solution, @maxwbot. Just to be clear, what platform/distribution were you seeing this problem on?

maxwbot · 2021-06-15T19:51:23Z

That sounds like a good solution, @maxwbot. Just to be clear, what platform/distribution were you seeing this problem on?

Linux: Centos 7 / Nagios

huamxu · 2024-03-08T02:55:35Z

We using gearmand 1.1.21.0 in a docker contianer which base on image docker.io/library/python:3.11.4

libev-libevent-dev/oldstable 1:4.33-1 all
libevent-2.1-7/oldstable,now 2.1.12-stable-1 amd64 [installed,automatic]
libevent-core-2.1-7/oldstable,now 2.1.12-stable-1 amd64 [installed,automatic]

Gearmand start with "gearmand -t 10 --job-retries 1 --verbose DEBUG -p 4730"

ERROR 2024-02-26 00:01:26.998647 [ 11 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221
ERROR 2024-02-26 00:14:33.430740 [ 13 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221
ERROR 2024-02-26 00:14:33.430775 [ 14 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221

Can anybody has similar issues and any good solution ?

esabol · 2024-03-08T03:13:09Z

@huamxu wrote:

Can anybody has similar issues and any good solution ?

Could you try the suggestion in the following comment?

#150 (comment)

You might have to do that both on the Docker host and inside the Docker container.

If that doesn't solve the problem, try changing "-t 10" to "-t 0" in your gearmand arguments. It's not an ideal solution, but it worked for the person in #150 (comment).

You could also try removing "--job-retries 1" from your gearmand arguments.

huamxu · 2024-03-08T03:27:18Z

Thanks esabol, I will have a try

huamxu · 2024-04-03T03:27:58Z

@esabol
ERROR 2024-02-26 00:14:33.430775 [ 14 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221

Do you or any body else know what is the "connection timeout" ? how to configure the timeout ?
Thanks very much

esabol · 2024-04-03T04:13:28Z

@huamxu : That's not the problem. Did you try both of the suggestions I (and others before me) made up above here?

huamxu · 2024-04-03T06:07:43Z

Yes we have tried. But problem still exist.
User scenario is that
1: Jenkins gearman plugin act gear works
2: multi-jenkins master with same configuration connect to the gearman server

Gearmand didn't receive anything after the job finished, from gearman plugin log they send the WORK_COMPLETE packet

SpamapS · 2024-04-06T16:28:25Z

How long did that job take to execute? Do you have any firewalls or such between gearmand and the Jenkins plugin?

The gearman plugin for Jenkins is just a worker. Gearman wasn't really designed with multi-minute jobs in mind, but the Jenkins plugin makes it work OK. Still, once you get the JOB_ASSIGN packet, no communication is going back and forth until the WORK_COMPLETE. So if your jobs are really long, and there's a NAT gateway, or stateful firewall at play, then those have timeouts, and they can be exceeded. So that's the first thing I'd check. You can make the connection alive with keepalives. The default time before one is sent is really long though, 2 hours:

$ cat /proc/sys/net/ipv4/tcp_keepalive_time
7200

I think you have a few options:

The gearman plugin is really old and has security vulnerabilities. Figure out a different way to run a pool of Jenkins workers. It might even be built in now? I haven't looked in a while.
Extend the timeouts that are closing the connections
You can reduce tcp_keepalive_time to something shorter, like 600 seconds, and that will make the kernel send keepalives that will keep a NAT/firewall happy.
Update the gearman jenkins plugin to send WORK_STATUS or WORK_DATA packets periodically.

Anyway, the original problem, where threading seems to be causing CLOSE_WAIT buildup, is not what you seem to have.

huamxu · 2024-04-07T02:24:48Z

@SpamapS Thanks very much
That's good info for locla investigating, Jenkins job running sometimes about 2 hours
Do you know how to extend the timeouts about 2nd option
"""
2. Extend the timeouts that are closing the connections
"""

I will try the 3rd option to reduce tcp_keepalive_time
and also check the gearman jenkins plugin

pythonerdog · 2024-05-27T09:20:07Z

@SpamapS @esabol
Thanks for your good support.
After enable keepalive issue didn't happened again for a long time ^_^

esabol · 2024-05-27T14:18:38Z

@pythonerdog wrote:

After enable keepalive issue didn't happened again for a long time ^_^

That's good to know. Thanks!

SpamapS added bug incomplete labels Dec 14, 2017

bmeynell mentioned this issue Dec 17, 2018

The feature of persistence plugins #227

Open

p-alik added the core label Dec 27, 2018

esabol mentioned this issue Mar 8, 2024

gearmand close connection with Connection timed out #386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gearmand failed to gracefully close connections under certain conditions #150

gearmand failed to gracefully close connections under certain conditions #150

hwang2014 commented Dec 13, 2017

evanrich commented Dec 13, 2017 •

edited

Loading

SpamapS commented Dec 14, 2017

hwang2014 commented Dec 16, 2017

esabol commented Dec 22, 2017 •

edited

Loading

Strainy commented Feb 6, 2018

esabol commented Feb 6, 2018

p-alik commented Feb 6, 2018 •

edited

Loading

SpamapS commented Feb 6, 2018

p-alik commented Feb 6, 2018

Strainy commented Feb 11, 2018 •

edited

Loading

sni commented Oct 9, 2019

maxwbot commented Jun 10, 2021

esabol commented Jun 11, 2021 •

edited

Loading

maxwbot commented Jun 15, 2021 •

edited

Loading

huamxu commented Mar 8, 2024

esabol commented Mar 8, 2024 •

edited

Loading

huamxu commented Mar 8, 2024

huamxu commented Apr 3, 2024

esabol commented Apr 3, 2024

huamxu commented Apr 3, 2024

SpamapS commented Apr 6, 2024

huamxu commented Apr 7, 2024

pythonerdog commented May 27, 2024

esabol commented May 27, 2024

gearmand failed to gracefully close connections under certain conditions #150

gearmand failed to gracefully close connections under certain conditions #150

Comments

hwang2014 commented Dec 13, 2017

evanrich commented Dec 13, 2017 • edited Loading

SpamapS commented Dec 14, 2017

hwang2014 commented Dec 16, 2017

esabol commented Dec 22, 2017 • edited Loading

Strainy commented Feb 6, 2018

esabol commented Feb 6, 2018

p-alik commented Feb 6, 2018 • edited Loading

SpamapS commented Feb 6, 2018

p-alik commented Feb 6, 2018

Strainy commented Feb 11, 2018 • edited Loading

sni commented Oct 9, 2019

maxwbot commented Jun 10, 2021

esabol commented Jun 11, 2021 • edited Loading

maxwbot commented Jun 15, 2021 • edited Loading

huamxu commented Mar 8, 2024

esabol commented Mar 8, 2024 • edited Loading

huamxu commented Mar 8, 2024

huamxu commented Apr 3, 2024

esabol commented Apr 3, 2024

huamxu commented Apr 3, 2024

SpamapS commented Apr 6, 2024

huamxu commented Apr 7, 2024

pythonerdog commented May 27, 2024

esabol commented May 27, 2024

evanrich commented Dec 13, 2017 •

edited

Loading

esabol commented Dec 22, 2017 •

edited

Loading

p-alik commented Feb 6, 2018 •

edited

Loading

Strainy commented Feb 11, 2018 •

edited

Loading

esabol commented Jun 11, 2021 •

edited

Loading

maxwbot commented Jun 15, 2021 •

edited

Loading

esabol commented Mar 8, 2024 •

edited

Loading