Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gearmand failed to gracefully close connections under certain conditions #150

Open
hwang2014 opened this issue Dec 13, 2017 · 24 comments
Open

Comments

@hwang2014
Copy link

In my use case, there are multiple thousands of clients talk with two gearmand servers (v1.1.17).
we periodically run "gearadmin --status" on the servers to check job status. It is observed that sometimes (repeatable almost every day), "gearadmin --status" stucks. When we check connection stats, it shows up many CLOSE-WAIT connections:

ss -tan | awk '{print $1}' | sort | uniq -c
207364 CLOSE-WAIT
17738 ESTAB
64 LAST-ACK
14 LISTEN
1 State
4 TIME-WAIT

The temp workaround is to restart gearmand for recovery.

@evanrich
Copy link

evanrich commented Dec 13, 2017

I'm working with the @hwang2014 on this, temporarily we autoremediate this issue with the following script:

#!/bin/bash
timeout 30 gearadmin --status
if [ $? -eq 124 ]
then
    TIMESTAMP=$(date)
    service gearmand restart
    echo "$TIMESTAMP: restarting gearmand due to 124 status" >> /tmp/gearmand_status.log
fi

However this is not desirable. This issue seems to only be present when gearmand is dealing with clients that are geographically apart, with a latency of ~40-50ms between them. We do not see this issue with gearmand and clients located in the same datacenter. while 40-50ms shouldn't be a concern, its a datapoint worth sharing.

We're running gearmand on centos 6.6

@SpamapS
Copy link
Member

SpamapS commented Dec 14, 2017

Hi! This is definitely a huge problem if you're seeing it. I want to bring up two things:

  1. CentOS 6 is really old, released 6 years ago. I understand, it's hard to roll out new platforms, but CentOS 7 has been out for 3 years. So I'm hesitant to spend a ton of time chasing any bugs that are related to CentOS 6 versions of software.

  2. What version of libevent are you using?

  3. Can you send logs from your gearmand's? The higher the logging level, the better.

  4. Do you have workers or clients that log their gearman interactions? Do you see a lot of failed connections from them?

Thanks for your patience and for using Gearman!

@hwang2014
Copy link
Author

Hi, the libevent RPM is libevent-1.4.13-4.el6.x86_64

fairly frequently, we got the following errors in gearmand log:

ERROR 2017-12-16 00:31:18.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:31:46.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:32:01.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:32:05.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:32:15.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:32:21.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:32:44.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:33:02.000000 [ 3 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:33:27.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:33:32.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:33:44.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:34:00.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:34:01.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:34:08.000000 [ 3 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:34:34.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-12-16 00:35:02.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218

@esabol
Copy link
Member

esabol commented Dec 22, 2017

libgearman-server/io.cc:218 is in _connection_read(). It might indicate that the client or worker is closing the connection to gearmand when gearmand doesn't expect it to. Maybe the problem is in either the implementation of the clients or workers? What language did you develop them with?

I'm not sure if it will help or not, but you can start gearmand with --verbose DEBUG or maybe --verbose TRACE (I forget which is higher) to get more detailed logs.

@Strainy
Copy link

Strainy commented Feb 6, 2018

I'm seeing an issue strikingly similar to this. We've attempted increasing the number of threads and have reduced the logging level. Has anyone come across any potential fixes?

@esabol
Copy link
Member

esabol commented Feb 6, 2018

I'm seeing an issue strikingly similar to this.

What platform/distribution/kernel? What version of libevent?

@p-alik
Copy link
Collaborator

p-alik commented Feb 6, 2018

I guess we eliminated an effect of the issue in #102 but not the cause.

@SpamapS
Copy link
Member

SpamapS commented Feb 6, 2018

If you can avoid using gearadmin status, you should. Increasing threads will not help, because it locks all of the internal structures to produce the status report. I've long thought we should add statsd support for gauges to gearmand so that it can just fire off status updates to an external system and not encourage people to use the admin protocol.

However, this seems to be related to connection handling. I don't have the resources to debug this. However, if I had to guess, I'd say that clients and workers are disappearing or closing their connections violently in a way that is confusing libevent.

For anyone experiencing this effect, please send your exact libevent and OS/kernel versions and your client libraries if possible, to help us track down the root cause. Thanks.

@p-alik
Copy link
Collaborator

p-alik commented Feb 6, 2018

I'd say that clients and workers are disappearing or closing their connections violently in a way that is confusing libevent.

Regarding to my experiences I would confirm this assumption.

I don't think it depend on particular environment because the issue exists for a log time. [1]

Unfortunately I couldn't simulate this behavior of gearmand with my client/worker implementations.

[1] https://groups.google.com/forum/#!topic/gearman/nyvLh0ZhmvA

@Strainy
Copy link

Strainy commented Feb 11, 2018

We're running GearmanD in a Docker container running in host network mode. We're leveraging a phusion base image, which effectively means we're running on Ubuntu 16.04 LTS base. We've installed libevent-2.0-5 as part of the image build.

Here's an example of the Dockerfile:

FROM phusion/baseimage:0.9.20

ARG GEARMAN_REPO=https://github.com/gearman/gearmand
ARG GEARMAN_VERSION=1.1.17

# Install GearmanD Dependencies
RUN apt-get update -qq && \
    apt-get install -y curl gcc make g++ libboost-all-dev gperf libevent-dev \
        uuid-dev libmysqld-dev libpq-dev libsqlite3-dev libhiredis-dev \
        unzip && \
    curl -L --silent -o /tmp/cyassl.zip http://www.yassl.com/cyassl-3.2.0.zip && \
    cd /tmp && \
    unzip -qq cyassl.zip && \
    cd cyassl* && \
    ./configure \
        --prefix=/usr && \
    make && \
    make install && \
    mkdir /tmp/gearman && \
    curl -L --insecure --silent ${GEARMAN_REPO}/releases/download/${GEARMAN_VERSION}/gearmand-${GEARMAN_VERSION}.tar.gz \
        | tar -xz --strip 1 -C /tmp/gearman && \
    cd /tmp/gearman && \
    ./configure \
        --prefix=/usr \
        --sysconfdir=/etc \
        --localstatedir=/var \
        --enable-hiredis && \
    make && \
    make install && \
    adduser --system --no-create-home --disabled-login --disabled-password --group gearman && \
    apt-get purge -y gcc make gperf libevent-dev uuid-dev \
        libmysqld-dev libpq-dev libsqlite3-dev libhiredis-dev unzip && \
    apt-get autoremove -y && \
    apt-get install -y telnet libmysqlclient20 libhiredis0.13 libevent-2.0-5 libpq5 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

EXPOSE 4730

ENTRYPOINT ["gearmand"]

You can run it with the following docker-compose setup:

version: '2.1'

services:

  gearmand:
    build:
      context: .
      dockerfile: Dockerfile.gearmand
    command:
      - --listen=0.0.0.0
      - --round-robin
      - --verbose=DEBUG
      - --log-file=/dev/stdout
      - --port=4730
      - --queue-type=MySQL
      - --mysql-host=mysql
      - --mysql-port=3306
      - --mysql-user=root
      - --mysql-password=password
      - --mysql-db=gearman
      - --mysql-table=gearman_1
    links:
      - mysql
    ports:
      - "4730:4730"
    depends_on:
      mysql:
        condition: service_healthy

  mysql:
    image: mysql:5.6
    healthcheck:
      test: ["CMD", "mysqladmin" ,"ping", "-h", "localhost"]
      interval: 20s
      timeout: 20s
      retries: 3
    environment:
      - MYSQL_ROOT_PASSWORD=password
      - MYSQL_DATABASE=gearman

@sni
Copy link

sni commented Oct 9, 2019

Using --threads=0 solves the issue for me. Had tons of CLOSE_WAIT connections after a couple of minutes and couldn't gracefully stop gearmand. But with disabling threads it runs quite smootly so far.

@maxwbot
Copy link

maxwbot commented Jun 10, 2021

I had this problem. I decided to put a time of 30s for tcp connections. Solved for me.
echo 30 > /proc/sys/net/ipv4/tcp_tw_recycle

@esabol
Copy link
Member

esabol commented Jun 11, 2021

That sounds like a good solution, @maxwbot. Just to be clear, what platform/distribution were you seeing this problem on?

@maxwbot
Copy link

maxwbot commented Jun 15, 2021

That sounds like a good solution, @maxwbot. Just to be clear, what platform/distribution were you seeing this problem on?

Linux: Centos 7 / Nagios

@huamxu
Copy link

huamxu commented Mar 8, 2024

We using gearmand 1.1.21.0 in a docker contianer which base on image docker.io/library/python:3.11.4

libev-libevent-dev/oldstable 1:4.33-1 all
libevent-2.1-7/oldstable,now 2.1.12-stable-1 amd64 [installed,automatic]
libevent-core-2.1-7/oldstable,now 2.1.12-stable-1 amd64 [installed,automatic]

Gearmand start with "gearmand -t 10 --job-retries 1 --verbose DEBUG -p 4730"

ERROR 2024-02-26 00:01:26.998647 [ 11 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221
ERROR 2024-02-26 00:14:33.430740 [ 13 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221
ERROR 2024-02-26 00:14:33.430775 [ 14 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221

Can anybody has similar issues and any good solution ?

@esabol
Copy link
Member

esabol commented Mar 8, 2024

@huamxu wrote:

Can anybody has similar issues and any good solution ?

Could you try the suggestion in the following comment?

#150 (comment)

You might have to do that both on the Docker host and inside the Docker container.

If that doesn't solve the problem, try changing "-t 10" to "-t 0" in your gearmand arguments. It's not an ideal solution, but it worked for the person in #150 (comment).

You could also try removing "--job-retries 1" from your gearmand arguments.

@huamxu
Copy link

huamxu commented Mar 8, 2024

Thanks esabol, I will have a try

@huamxu
Copy link

huamxu commented Apr 3, 2024

@esabol
ERROR 2024-02-26 00:14:33.430775 [ 14 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221

Do you or any body else know what is the "connection timeout" ? how to configure the timeout ?
Thanks very much

@esabol
Copy link
Member

esabol commented Apr 3, 2024

@huamxu : That's not the problem. Did you try both of the suggestions I (and others before me) made up above here?

@huamxu
Copy link

huamxu commented Apr 3, 2024

Yes we have tried. But problem still exist.
User scenario is that
1: Jenkins gearman plugin act gear works
2: multi-jenkins master with same configuration connect to the gearman server

Gearmand didn't receive anything after the job finished, from gearman plugin log they send the WORK_COMPLETE packet

@SpamapS
Copy link
Member

SpamapS commented Apr 6, 2024

How long did that job take to execute? Do you have any firewalls or such between gearmand and the Jenkins plugin?

The gearman plugin for Jenkins is just a worker. Gearman wasn't really designed with multi-minute jobs in mind, but the Jenkins plugin makes it work OK. Still, once you get the JOB_ASSIGN packet, no communication is going back and forth until the WORK_COMPLETE. So if your jobs are really long, and there's a NAT gateway, or stateful firewall at play, then those have timeouts, and they can be exceeded. So that's the first thing I'd check. You can make the connection alive with keepalives. The default time before one is sent is really long though, 2 hours:

$ cat /proc/sys/net/ipv4/tcp_keepalive_time
7200

I think you have a few options:

  1. The gearman plugin is really old and has security vulnerabilities. Figure out a different way to run a pool of Jenkins workers. It might even be built in now? I haven't looked in a while.
  2. Extend the timeouts that are closing the connections
  3. You can reduce tcp_keepalive_time to something shorter, like 600 seconds, and that will make the kernel send keepalives that will keep a NAT/firewall happy.
  4. Update the gearman jenkins plugin to send WORK_STATUS or WORK_DATA packets periodically.

Anyway, the original problem, where threading seems to be causing CLOSE_WAIT buildup, is not what you seem to have.

@huamxu
Copy link

huamxu commented Apr 7, 2024

@SpamapS Thanks very much
That's good info for locla investigating, Jenkins job running sometimes about 2 hours
Do you know how to extend the timeouts about 2nd option
"""
2. Extend the timeouts that are closing the connections
"""

I will try the 3rd option to reduce tcp_keepalive_time
and also check the gearman jenkins plugin

@pythonerdog
Copy link

@SpamapS @esabol
Thanks for your good support.
After enable keepalive issue didn't happened again for a long time ^_^

@esabol
Copy link
Member

esabol commented May 27, 2024

@pythonerdog wrote:

After enable keepalive issue didn't happened again for a long time ^_^

That's good to know. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants