bug: In the case where heartbeat checking is enabled, unhealthy nodes are inexplicably added back and then kicked out again. #10500

wodingyang · 2023-11-16T03:27:24Z

Current Behavior

Configure two nodes in the upstream, one healthy and one unhealthy, and enable active heartbeat checking. Then initiate a user request, and when heartbeat health is enabled, the unhealthy node is kicked out. After waiting for a while, it is observed that the unhealthy node is added back and then re-triggered for eviction.During this waiting period, regardless of whether there are user requests or not, the unhealthy node is added back. In our testing, we disabled retries. After a continuous series of requests, it was observed that the unhealthy node was added back. The testing system received a significant number of 502 errors, and upon checking the logs of API Gateway, it was found that these 502 errors were forwarded to the unhealthy node.

Expected Behavior

We attempted to inspect the source code and found in upstream.lua that...

local healthcheck_parent = upstream.parent
    if healthcheck_parent.checker and healthcheck_parent.checker_upstream == upstream then
        return healthcheck_parent.checker
    end

    local checker, err = healthcheck.new({
        name = get_healthchecker_name(healthcheck_parent),
        shm_name = "upstream-healthcheck",
        checks = upstream.checks,
    })

The first time the heartbeat is enabled, it calls healthcheck.new(), and subsequent requests go through the logic of if healthcheck_parent.checker and healthcheck_parent.checker_upstream == upstream then. However, after a certain period of time, this if statement becomes ineffective and a new check is created, while the previous check is deleted. We are unsure about the reason behind this if statement and would appreciate assistance in identifying the issue.
The time for adding unhealthy nodes back is highly variable, sometimes around one minute and other times around five minutes. Is there any caching or mechanism involved in this process?

Error Logs

No response

Steps to Reproduce

Step 1: Configuring the upstream and route

{
  "nodes": {
    "10.110.3.52:8888": 1,
    "10.110.3.51:8888": 1,
    "10.110.3.50:8888": 1
  },
  "retries": 0,
  "name": "测试心跳",
  "type": "roundrobin",
  "timeout": {
    "connect": 60,
    "read": 60,
    "send": 60
  },
  "pass_host": "pass",
  "scheme": "http",
  "keepalive_pool": {
    "idle_timeout": 60,
    "requests": 1000,
    "size": 320
  },
  "checks": {
    "active": {
      "concurrency": 10,
      "healthy": {
        "http_statuses": [
          200,
          302
        ],
        "interval": 1,
        "successes": 2
      },
      "http_path": "/firstwork/healthcheck",
      "timeout": 5,
      "type": "http",
      "unhealthy": {
        "http_failures": 5,
        "http_statuses": [
          429,
          404,
          500,
          501,
          502,
          503,
          504,
          505
        ],
        "interval": 10,
        "tcp_failures": 2,
        "timeouts": 2
      }
    }
  }
}

{
  "uri": "/*",
  "upstream_id": "00021698712371002437",
  "name": "测试心跳问题",
  "host": "",
  "methods": [
    "GET",
    "HEAD",
    "POST",
    "PUT",
    "DELETE",
    "OPTIONS",
    "PATCH",
    "CONNECT",
    "TRACE"
  ],
  "priority": 0,
  "enable_websocket": false,
  "status": 1
}

Step 2: Continuously initiate user requests
Step 3: Check the logs and receive

2023/11/16 11:19:45 [warn] 30949#30949: *13223 [lua] healthcheck.lua:1107: log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy TCP increment (1/2) for '(10.110.3.50:8888)'
2023/11/16 11:19:46 [warn] 30949#30949: *13631 [lua] healthcheck.lua:1107: log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy TCP increment (2/2) for '(10.110.3.50:8888)'

Step 4: Receive again after a certain period of time

2023/11/16 11:24:46 [warn] 30949#30949: *25421 [lua] healthcheck.lua:1107: log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy TCP increment (1/2) for '(10.110.3.50:8888)'
2023/11/16 11:24:47 [warn] 30949#30949: *25591 [lua] healthcheck.lua:1107: log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy TCP increment (2/2) for '(10.110.3.50:8888)'

Environment

APISIX version (run apisix version):2.15.3
Operating system (run uname -a):centos7
OpenResty / Nginx version (run openresty -V or nginx -V):1.21.4.1

The text was updated successfully, but these errors were encountered:

moonming · 2023-11-16T06:53:59Z

@wodingyang Are there any steps to reproduce? For example, etcd starts the compression policy, etc.

shreemaan-abhishek · 2023-11-16T19:31:21Z

v2.15.3 is very old and not in LTS. I could not reproduce this with the latest version. Recommend you to upgrade to latest.

wodingyang · 2023-11-20T01:45:40Z

@moonming Indeed, we have found that enabling compression in etcd can cause this issue.

wodingyang · 2023-11-20T01:47:28Z

@shreemaan-abhishek We have confirmed that this is a hidden bug caused by enabling compression in etcd. The active check itself is not the problem, but when etcd compression is enabled, it triggers a reload of the /upstream that APIsix is listening to, causing the original check to fail. Even the latest version of etcd still experiences this issue.

moonming · 2023-11-20T01:57:24Z

but when etcd compression is enabled, it triggers a reload of the /upstream that APIsix is listening to, causing the original check to fail

The revision of etcd should not change, even if etcd is compressed. So how is this /upstream reload triggered?

wodingyang · 2023-11-20T02:27:16Z

Based on the information you provided, it appears that the issue is related to the interaction between the "config_etcd.lua" and "upstream.lua" files in APIsix.

When the line "will read the configuration again via readdir" is executed and outputted, it seems to trigger a condition check in the code: "if healthcheck_parent.checker and healthcheck_parent.checker_upstream == upstream". However, this condition check is evaluating to false.

To further investigate and resolve this issue, it would be helpful to analyze the specific code implementation in "config_etcd.lua" and "upstream.lua" related to this condition check. Additionally, it may be beneficial to review the etcd version 3.4.21 documentation for any potential known issues or updates related to this behavior.

moonming · 2023-11-20T06:42:50Z

If the configuration of the control plane does not change, but etcd is compressed, the revision of etcd will not change. At this time, the APISIX data plane will not be configured synchronously, and the cache will not be cleared.
However, if etcd compression and configuration updates are performed at the same time, and the APISIX data plane finds that a certain revision of etcd cannot be obtained, full synchronization will be triggered and the cache will be cleared: https://github.com/apache/apisix/blob/master/apisix/core/etcd.lua#L239-L245

To give an example: the etcd revision at 9 a.m. is 100. At 9:10, etcd starts to compress the data, and new changes are also written to etcd at the same time. The latest revision becomes 110. When the compression is completed, the data in revisions from 1 to 109 are gone.
During this process, the APISIX data plane may obtain revision 105, but revision 105 in etcd has been compressed.
Then APISIX will trigger full synchronization. This may be the problem you are encountering.

The solution can be to not clear all historical revisions when performing etcd compression, but to retain some of the most recent revisions, such as the most recent 1,000. Foe example use auto-compaction-retention = 1000 in etcd https://etcd.io/docs/v3.4/op-guide/maintenance/#auto-compaction

You can try this solution. If you have any questions, you can update here.

wodingyang · 2023-11-20T07:36:49Z

The suggested configuration can only mitigate the issue but may not completely resolve it. We are currently puzzled by the fact that the "/apisix/upstreams" key has not changed, and our system automatically reports data for our custom keys. In theory, the listener for "/apisix/upstreams" should not be triggered. We are planning to reduce the compression frequency to once a day, but we still hope to identify and resolve this problem at its root.

Sn0rt · 2023-12-18T07:37:52Z

@wodingyang I have create a test case for reproduce this issues. can you help me take a look ? if you agress my test case can you close this issues ? we will track this promble by another issue.

sheharyaar · 2023-12-18T12:03:09Z

From APISIX side, can we cache the last revision to mitigate this issue ?

wodingyang · 2023-12-19T01:09:43Z

@wodingyang I have create a test case for reproduce this issues. can you help me take a look ? if you agress my test case can you close this issues ? we will track this promble by another issue.

As long as etcd enables the compression policy, executes the compression, and notifies /apisix/upstreams to update the cache, this issue will be reproduced. Moreover, the notification to /apisix/upstreams is not only sent when modifying upstreams, I am also unclear under what circumstances etcd compresses the key and updates this.

Sn0rt · 2023-12-19T06:11:41Z

OK. I will take over this issue. pls assign to me

wodingyang · 2023-12-19T06:35:50Z

OK. I will take over this issue. pls assign to me

It seems like I don't have the permission to assign.

sheharyaar · 2024-01-02T09:42:45Z

@shreemaan-abhishek , can you please assign me this ?

jujiale · 2024-01-16T09:40:26Z

we also suffered this problem

shreemaan-abhishek · 2024-01-16T10:09:41Z

could you please share your compation configurations?

shreemaan-abhishek · 2024-01-16T13:42:38Z

I was able to force update the revision number during compression which led to complete reload of apisix resources in the cache. But this didn't lead to unhealthy TCP increment (n/N)for '(ip-foo:port-bar)'.

Steps to repro:

set etcd compression policy to 2 minutes : etcd --auto-compaction-retention=2m
start apisix and create upstreams and routes as described in issue description
force update the etcd revision:

while true
do
ectl put a b
done

keep sending requests to apisix (not sure if this is important):

while true
do
sleep 0.5
curl -i "http://127.0.0.1:9080/hello"
done

wait for the race condition to occur, you might wanna use wrk for faster result

shreemaan-abhishek · 2024-01-16T13:43:50Z

We are currently puzzled by the fact that the "/apisix/upstreams" key has not changed,

@wodingyang, even if you do not change the "/apisix/upstreams" key, the etcd revision can be updated by modifying other keys.

jujiale · 2024-01-16T16:29:50Z

at first, I add the upstream, like below: the 172.20.xxx.xxx is OK，the 172.30.xxx.xxx is bad.
`

{
"nodes": [
  {
    "host": "172.20.xxx.xxx",
    "port": 9113,
    "weight": 1
  },
  {
    "host": "172.30.xxx.xxx",
    "port": 9113,
    "weight": 1
  }
],
"timeout": {
  "connect": 6,
  "send": 6,
  "read": 6
},
"type": "roundrobin",
"checks": {
  "active": {
    "concurrency": 10,
    "healthy": {
      "http_statuses": [
        200,
        302
      ],
      "interval": 1,
      "successes": 2
    },
    "http_path": "/",
    "port": 9113,
    "timeout": 2,
    "type": "tcp",
    "unhealthy": {
      "http_failures": 5,
      "http_statuses": [
        429,
        404,
        500,
        501,
        502,
        503,
        504,
        505
      ],
      "interval": 1,
      "tcp_failures": 2,
      "timeouts": 3
    }
  }
},
"scheme": "http",
"pass_host": "pass",
"name": "test2",
"keepalive_pool": {
  "idle_timeout": 60,
  "requests": 1000,
  "size": 320
}

}

`

then I create a route which bind the upstream. and I send a request to this route, It returns the result as I expect, then I invoke http://apisix:9090/v1/healthcheck, the return is :

`

[{
    "src_type": "upstreams",
    "healthy_nodes": [{
	    "host": "172.20.xxx.xxx",
	    "priority": 0,
	    "port": 9113,
	    "weight": 1
    }],
    "src_id": "496411420712239919",
    "name": "upstream#\/test\/apisix\/upstreams\/496411420712239919",
    "nodes": [{
	    "host": "172.20.xxx.xxx",
	    "priority": 0,
	    "port": 9113,
	    "weight": 1
    }, {
	    "host": "172.30.xxx.xxx",
	    "priority": 0,
	    "port": 9113,
	    "weight": 1
    }]
}]

`

then I modify the etcd config auto-compaction-retention=1m after a while, the apisix log as following:
`

sync_data(): waitdir [/test/apisix/plugin_configs] err: compacted, will read the configuration again via readdir
...

`

then I invoke http://apisix:9090/v1/healthcheck the result returns an empty table {}

then I send a request to my test route, It also return the data I want, but the response code is 502,200

as the author say, he thinks that health check readded the unhealthy one, but I think the reason is that both healthy one and unhealthy one is deleted, this scenario seems like the same as when you create a route, but send no request. then send a request, the health check then start.

wodingyang · 2024-01-17T01:05:50Z

This issue should be very clear. When ETCD enables key compression, each compression will cause part of the upstream cache to become invalid and reload. Moreover, the official heartbeat strategy does not trigger without user requests. When these two issues occur at the same time, problematic nodes will be added back to the upstream without being kicked out, leading to abnormal user requests. I would like to ask if the official will acknowledge and resolve this issue?

membphis · 2024-01-17T09:38:56Z

set etcd compression policy to 2 minutes : etcd --auto-compaction-retention=2m

start apisix and create upstreams and routes as described in issue description

the best practice way:

When compressing etcd data, keep more recent historical records (such as 1k records). Then, the lua client will not get compact errors.

Vacant2333 · 2024-01-18T11:20:03Z

I think we have found the reason and how to reproduce this issue:

Reason:

In the code shown in the following figure, when Etcd undergoes a compaction action, it will result in the need_ reload is marked as true.
https://github.com/Vacant2333/apisix/blob/a18573a3b8775f5a92ba92b948e8212affbf7057/apisix/core/config_etcd.lua#L394

Then it will flush all cache related to this piece of content (including healthcheck of the nodes).
https://github.com/Vacant2333/apisix/blob/a18573a3b8775f5a92ba92b948e8212affbf7057/apisix/core/config_etcd.lua#L362

Q: Why it will happen(etcd compaction)?
A: Perhaps the user has set auto-compaction on ETCD or run etcdctl compaction X.

Q: Why APISIX needs reload when compaction happend?
A: Let me give an example:

Start APISIX:
The revisions for APISIX and Etcd are both 60, which means they are in a synchronized state.
Added a Route through APISIX:
At this time, the revision for APISIX is still 60, but the latest revision for Etcd is actually 61.
Before APISIX sync the new revision, manually/automatically execute etcdctl compaction 61:
At this point, Etcd cannot synchronize the content of revision 61 to APISIX because revisions 1-61 have already been Compact (i.e. snapshot in Raft), and APISIX can only choose full synchronization (which is usually done incrementally by adding revision). Therefore, full synchronization must be done.

Reporduce steps:

Start APISIX and Etcd:
APISIX 3.0 and below are easier to reproduce.
Create a Route:

curl "http://127.0.0.1:9180/apisix/admin/routes/6" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d '
{
  "methods": ["GET"],
  "host": "example.com",
  "uri": "/anything/*",
  "upstream": {
    "type": "roundrobin",
    "nodes": {
      "httpbin.org:80": 1
    }
  }
}'

Get the newly revision and run compaction:

# Get the newly revision id
etcdctl endpoint status --write-out=json | jq .[0].Status.header.revision
# Run compaction
etcdctl compaction {Latest-revision}

Wait 2 minutes:

How to resolve this problem:

Update APISIX to version 3.2 LTS at least.
Avoid clearing the latest revisions (such as the latest 1000).

For now, i think the issue could close now, cc @shreemaan-abhishek

wodingyang · 2024-01-19T00:25:16Z

This issue can be reproduced in version 3.6, so upgrading to any version above 3.2 will not resolve it.
As for setting the reserved revisions to 1000, although it can prevent this situation from occurring frequently, it does not mean that it will not happen 100% of the time. If there is no thorough solution to this problem, this method can be used as a workaround.

membphis · 2024-01-19T01:28:25Z

As for setting the reserved revisions to 1000, although it can prevent this situation from occurring frequently, it does not mean that it will not happen 100% of the time. If there is no thorough solution to this problem, this method can be used as a workaround.

this issue can not be resolved by APISIX self.

the real reason:etcd latest revision history is compacted, before they has had time to be published to the gateway instance. we need to change the way about how to compact the etcd history record.

so the good way is:

keep the latest revision history for a while(maybe more than 10mins or 1 hour, it is all fine) when we compact etcd data.

Fijileijun · 2024-01-19T02:47:26Z

This issue can be reproduced in version 3.6, so upgrading to any version above 3.2 will not resolve it. As for setting the reserved revisions to 1000, although it can prevent this situation from occurring frequently, it does not mean that it will not happen 100% of the time. If there is no thorough solution to this problem, this method can be used as a workaround.

Anyway, it's a good idea to get goal in roundabout way. 😄

github-project-automation bot added this to Apache APISIX backlog Nov 16, 2023

shreemaan-abhishek added the bug Something isn't working label Nov 16, 2023

shreemaan-abhishek added the can't reproduce label Nov 16, 2023

moonming moved this to 🏗 In progress in Apache APISIX backlog Nov 20, 2023

shreemaan-abhishek added discuss and removed can't reproduce labels Nov 21, 2023

wklken mentioned this issue Nov 27, 2023

bk-apigateway FAQ TencentBlueKing/blueking-apigateway#74

Closed

Sn0rt mentioned this issue Dec 18, 2023

bug: health checker target leak #10636

Closed

monkeyDluffy6017 assigned Sn0rt Dec 21, 2023

moonming assigned sheharyaar and unassigned Sn0rt Jan 2, 2024

jujiale mentioned this issue Jan 16, 2024

bug: active healthy check occur error when upstream node is not exist #8945

Closed

shreemaan-abhishek mentioned this issue Jan 17, 2024

fix: don't release healthchecker during etcd sync due to compaction #10837

Closed

5 tasks

membphis closed this as completed Jan 19, 2024

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Apache APISIX backlog Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: In the case where heartbeat checking is enabled, unhealthy nodes are inexplicably added back and then kicked out again. #10500

bug: In the case where heartbeat checking is enabled, unhealthy nodes are inexplicably added back and then kicked out again. #10500

wodingyang commented Nov 16, 2023 •

edited by shreemaan-abhishek

Loading

moonming commented Nov 16, 2023

shreemaan-abhishek commented Nov 16, 2023

wodingyang commented Nov 20, 2023

wodingyang commented Nov 20, 2023

moonming commented Nov 20, 2023

wodingyang commented Nov 20, 2023

moonming commented Nov 20, 2023

wodingyang commented Nov 20, 2023

Sn0rt commented Dec 18, 2023

sheharyaar commented Dec 18, 2023

wodingyang commented Dec 19, 2023

Sn0rt commented Dec 19, 2023

wodingyang commented Dec 19, 2023

sheharyaar commented Jan 2, 2024

jujiale commented Jan 16, 2024

shreemaan-abhishek commented Jan 16, 2024

shreemaan-abhishek commented Jan 16, 2024

shreemaan-abhishek commented Jan 16, 2024 •

edited

Loading

jujiale commented Jan 16, 2024 •

edited

Loading

wodingyang commented Jan 17, 2024

membphis commented Jan 17, 2024

Vacant2333 commented Jan 18, 2024

wodingyang commented Jan 19, 2024

membphis commented Jan 19, 2024

Fijileijun commented Jan 19, 2024

bug: In the case where heartbeat checking is enabled, unhealthy nodes are inexplicably added back and then kicked out again. #10500

bug: In the case where heartbeat checking is enabled, unhealthy nodes are inexplicably added back and then kicked out again. #10500

Comments

wodingyang commented Nov 16, 2023 • edited by shreemaan-abhishek Loading

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

moonming commented Nov 16, 2023

shreemaan-abhishek commented Nov 16, 2023

wodingyang commented Nov 20, 2023

wodingyang commented Nov 20, 2023

moonming commented Nov 20, 2023

wodingyang commented Nov 20, 2023

moonming commented Nov 20, 2023

wodingyang commented Nov 20, 2023

Sn0rt commented Dec 18, 2023

sheharyaar commented Dec 18, 2023

wodingyang commented Dec 19, 2023

Sn0rt commented Dec 19, 2023

wodingyang commented Dec 19, 2023

sheharyaar commented Jan 2, 2024

jujiale commented Jan 16, 2024

shreemaan-abhishek commented Jan 16, 2024

shreemaan-abhishek commented Jan 16, 2024

Steps to repro:

shreemaan-abhishek commented Jan 16, 2024 • edited Loading

jujiale commented Jan 16, 2024 • edited Loading

wodingyang commented Jan 17, 2024

membphis commented Jan 17, 2024

Vacant2333 commented Jan 18, 2024

Reason:

Reporduce steps:

How to resolve this problem:

wodingyang commented Jan 19, 2024

membphis commented Jan 19, 2024

Fijileijun commented Jan 19, 2024

wodingyang commented Nov 16, 2023 •

edited by shreemaan-abhishek

Loading

shreemaan-abhishek commented Jan 16, 2024 •

edited

Loading

jujiale commented Jan 16, 2024 •

edited

Loading