Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: In the case where heartbeat checking is enabled, unhealthy nodes are inexplicably added back and then kicked out again. #10500

Closed
wodingyang opened this issue Nov 16, 2023 · 25 comments
Assignees
Labels
bug Something isn't working discuss

Comments

@wodingyang
Copy link

wodingyang commented Nov 16, 2023

Current Behavior

Configure two nodes in the upstream, one healthy and one unhealthy, and enable active heartbeat checking. Then initiate a user request, and when heartbeat health is enabled, the unhealthy node is kicked out. After waiting for a while, it is observed that the unhealthy node is added back and then re-triggered for eviction.During this waiting period, regardless of whether there are user requests or not, the unhealthy node is added back. In our testing, we disabled retries. After a continuous series of requests, it was observed that the unhealthy node was added back. The testing system received a significant number of 502 errors, and upon checking the logs of API Gateway, it was found that these 502 errors were forwarded to the unhealthy node.

Expected Behavior

We attempted to inspect the source code and found in upstream.lua that...

local healthcheck_parent = upstream.parent
    if healthcheck_parent.checker and healthcheck_parent.checker_upstream == upstream then
        return healthcheck_parent.checker
    end

    local checker, err = healthcheck.new({
        name = get_healthchecker_name(healthcheck_parent),
        shm_name = "upstream-healthcheck",
        checks = upstream.checks,
    })

The first time the heartbeat is enabled, it calls healthcheck.new(), and subsequent requests go through the logic of if healthcheck_parent.checker and healthcheck_parent.checker_upstream == upstream then. However, after a certain period of time, this if statement becomes ineffective and a new check is created, while the previous check is deleted. We are unsure about the reason behind this if statement and would appreciate assistance in identifying the issue.
The time for adding unhealthy nodes back is highly variable, sometimes around one minute and other times around five minutes. Is there any caching or mechanism involved in this process?

Error Logs

No response

Steps to Reproduce

Step 1: Configuring the upstream and route

{
  "nodes": {
    "10.110.3.52:8888": 1,
    "10.110.3.51:8888": 1,
    "10.110.3.50:8888": 1
  },
  "retries": 0,
  "name": "测试心跳",
  "type": "roundrobin",
  "timeout": {
    "connect": 60,
    "read": 60,
    "send": 60
  },
  "pass_host": "pass",
  "scheme": "http",
  "keepalive_pool": {
    "idle_timeout": 60,
    "requests": 1000,
    "size": 320
  },
  "checks": {
    "active": {
      "concurrency": 10,
      "healthy": {
        "http_statuses": [
          200,
          302
        ],
        "interval": 1,
        "successes": 2
      },
      "http_path": "/firstwork/healthcheck",
      "timeout": 5,
      "type": "http",
      "unhealthy": {
        "http_failures": 5,
        "http_statuses": [
          429,
          404,
          500,
          501,
          502,
          503,
          504,
          505
        ],
        "interval": 10,
        "tcp_failures": 2,
        "timeouts": 2
      }
    }
  }
}
{
  "uri": "/*",
  "upstream_id": "00021698712371002437",
  "name": "测试心跳问题",
  "host": "",
  "methods": [
    "GET",
    "HEAD",
    "POST",
    "PUT",
    "DELETE",
    "OPTIONS",
    "PATCH",
    "CONNECT",
    "TRACE"
  ],
  "priority": 0,
  "enable_websocket": false,
  "status": 1
}

Step 2: Continuously initiate user requests
Step 3: Check the logs and receive

2023/11/16 11:19:45 [warn] 30949#30949: *13223 [lua] healthcheck.lua:1107: log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy TCP increment (1/2) for '(10.110.3.50:8888)'
2023/11/16 11:19:46 [warn] 30949#30949: *13631 [lua] healthcheck.lua:1107: log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy TCP increment (2/2) for '(10.110.3.50:8888)'

Step 4: Receive again after a certain period of time

2023/11/16 11:24:46 [warn] 30949#30949: *25421 [lua] healthcheck.lua:1107: log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy TCP increment (1/2) for '(10.110.3.50:8888)'
2023/11/16 11:24:47 [warn] 30949#30949: *25591 [lua] healthcheck.lua:1107: log(): [healthcheck] (upstream#/ding/upstreams/00021698712371002437) unhealthy TCP increment (2/2) for '(10.110.3.50:8888)'

Environment

  • APISIX version (run apisix version):2.15.3
  • Operating system (run uname -a):centos7
  • OpenResty / Nginx version (run openresty -V or nginx -V):1.21.4.1
@shreemaan-abhishek shreemaan-abhishek added the bug Something isn't working label Nov 16, 2023
@moonming
Copy link
Member

@wodingyang Are there any steps to reproduce? For example, etcd starts the compression policy, etc.

@shreemaan-abhishek
Copy link
Contributor

v2.15.3 is very old and not in LTS. I could not reproduce this with the latest version. Recommend you to upgrade to latest.

@wodingyang
Copy link
Author

@moonming Indeed, we have found that enabling compression in etcd can cause this issue.

@wodingyang
Copy link
Author

@shreemaan-abhishek We have confirmed that this is a hidden bug caused by enabling compression in etcd. The active check itself is not the problem, but when etcd compression is enabled, it triggers a reload of the /upstream that APIsix is listening to, causing the original check to fail. Even the latest version of etcd still experiences this issue.

@moonming
Copy link
Member

but when etcd compression is enabled, it triggers a reload of the /upstream that APIsix is listening to, causing the original check to fail

The revision of etcd should not change, even if etcd is compressed. So how is this /upstream reload triggered?

@moonming moonming moved this to 🏗 In progress in Apache APISIX backlog Nov 20, 2023
@wodingyang
Copy link
Author

image
image
Based on the information you provided, it appears that the issue is related to the interaction between the "config_etcd.lua" and "upstream.lua" files in APIsix.

When the line "will read the configuration again via readdir" is executed and outputted, it seems to trigger a condition check in the code: "if healthcheck_parent.checker and healthcheck_parent.checker_upstream == upstream". However, this condition check is evaluating to false.

To further investigate and resolve this issue, it would be helpful to analyze the specific code implementation in "config_etcd.lua" and "upstream.lua" related to this condition check. Additionally, it may be beneficial to review the etcd version 3.4.21 documentation for any potential known issues or updates related to this behavior.

@moonming
Copy link
Member

If the configuration of the control plane does not change, but etcd is compressed, the revision of etcd will not change. At this time, the APISIX data plane will not be configured synchronously, and the cache will not be cleared.
However, if etcd compression and configuration updates are performed at the same time, and the APISIX data plane finds that a certain revision of etcd cannot be obtained, full synchronization will be triggered and the cache will be cleared: https://github.com/apache/apisix/blob/master/apisix/core/etcd.lua#L239-L245

To give an example: the etcd revision at 9 a.m. is 100. At 9:10, etcd starts to compress the data, and new changes are also written to etcd at the same time. The latest revision becomes 110. When the compression is completed, the data in revisions from 1 to 109 are gone.
During this process, the APISIX data plane may obtain revision 105, but revision 105 in etcd has been compressed.
Then APISIX will trigger full synchronization. This may be the problem you are encountering.

The solution can be to not clear all historical revisions when performing etcd compression, but to retain some of the most recent revisions, such as the most recent 1,000. Foe example use auto-compaction-retention = 1000 in etcd https://etcd.io/docs/v3.4/op-guide/maintenance/#auto-compaction

You can try this solution. If you have any questions, you can update here.

@wodingyang
Copy link
Author

The suggested configuration can only mitigate the issue but may not completely resolve it. We are currently puzzled by the fact that the "/apisix/upstreams" key has not changed, and our system automatically reports data for our custom keys. In theory, the listener for "/apisix/upstreams" should not be triggered. We are planning to reduce the compression frequency to once a day, but we still hope to identify and resolve this problem at its root.

@Sn0rt
Copy link
Contributor

Sn0rt commented Dec 18, 2023

@wodingyang I have create a test case for reproduce this issues. can you help me take a look ? if you agress my test case can you close this issues ? we will track this promble by another issue.

@sheharyaar
Copy link
Contributor

From APISIX side, can we cache the last revision to mitigate this issue ?

@wodingyang
Copy link
Author

@wodingyang I have create a test case for reproduce this issues. can you help me take a look ? if you agress my test case can you close this issues ? we will track this promble by another issue.

As long as etcd enables the compression policy, executes the compression, and notifies /apisix/upstreams to update the cache, this issue will be reproduced. Moreover, the notification to /apisix/upstreams is not only sent when modifying upstreams, I am also unclear under what circumstances etcd compresses the key and updates this.

@Sn0rt
Copy link
Contributor

Sn0rt commented Dec 19, 2023

OK. I will take over this issue. pls assign to me

@wodingyang
Copy link
Author

OK. I will take over this issue. pls assign to me

It seems like I don't have the permission to assign.

@sheharyaar
Copy link
Contributor

@shreemaan-abhishek , can you please assign me this ?

@jujiale
Copy link
Contributor

jujiale commented Jan 16, 2024

we also suffered this problem

@shreemaan-abhishek
Copy link
Contributor

could you please share your compation configurations?

@shreemaan-abhishek
Copy link
Contributor

I was able to force update the revision number during compression which led to complete reload of apisix resources in the cache. But this didn't lead to unhealthy TCP increment (n/N)for '(ip-foo:port-bar)'.

Steps to repro:

  • set etcd compression policy to 2 minutes : etcd --auto-compaction-retention=2m
  • start apisix and create upstreams and routes as described in issue description
  • force update the etcd revision:
while true
do
ectl put a b
done
  • keep sending requests to apisix (not sure if this is important):
while true
do
sleep 0.5
curl -i "http://127.0.0.1:9080/hello"
done
  • wait for the race condition to occur, you might wanna use wrk for faster result

@shreemaan-abhishek
Copy link
Contributor

shreemaan-abhishek commented Jan 16, 2024

We are currently puzzled by the fact that the "/apisix/upstreams" key has not changed,

@wodingyang, even if you do not change the "/apisix/upstreams" key, the etcd revision can be updated by modifying other keys.

@jujiale
Copy link
Contributor

jujiale commented Jan 16, 2024

at first, I add the upstream, like below: the 172.20.xxx.xxx is OK,the 172.30.xxx.xxx is bad.
`

{
"nodes": [
  {
    "host": "172.20.xxx.xxx",
    "port": 9113,
    "weight": 1
  },
  {
    "host": "172.30.xxx.xxx",
    "port": 9113,
    "weight": 1
  }
],
"timeout": {
  "connect": 6,
  "send": 6,
  "read": 6
},
"type": "roundrobin",
"checks": {
  "active": {
    "concurrency": 10,
    "healthy": {
      "http_statuses": [
        200,
        302
      ],
      "interval": 1,
      "successes": 2
    },
    "http_path": "/",
    "port": 9113,
    "timeout": 2,
    "type": "tcp",
    "unhealthy": {
      "http_failures": 5,
      "http_statuses": [
        429,
        404,
        500,
        501,
        502,
        503,
        504,
        505
      ],
      "interval": 1,
      "tcp_failures": 2,
      "timeouts": 3
    }
  }
},
"scheme": "http",
"pass_host": "pass",
"name": "test2",
"keepalive_pool": {
  "idle_timeout": 60,
  "requests": 1000,
  "size": 320
}

}

`

then I create a route which bind the upstream. and I send a request to this route, It returns the result as I expect, then I invoke http://apisix:9090/v1/healthcheck, the return is :

`

[{
    "src_type": "upstreams",
    "healthy_nodes": [{
	    "host": "172.20.xxx.xxx",
	    "priority": 0,
	    "port": 9113,
	    "weight": 1
    }],
    "src_id": "496411420712239919",
    "name": "upstream#\/test\/apisix\/upstreams\/496411420712239919",
    "nodes": [{
	    "host": "172.20.xxx.xxx",
	    "priority": 0,
	    "port": 9113,
	    "weight": 1
    }, {
	    "host": "172.30.xxx.xxx",
	    "priority": 0,
	    "port": 9113,
	    "weight": 1
    }]
}]

`

then I modify the etcd config auto-compaction-retention=1m after a while, the apisix log as following:
`

sync_data(): waitdir [/test/apisix/plugin_configs] err: compacted, will read the configuration again via readdir
...

`

then I invoke http://apisix:9090/v1/healthcheck the result returns an empty table {}

then I send a request to my test route, It also return the data I want, but the response code is 502,200

as the author say, he thinks that health check readded the unhealthy one, but I think the reason is that both healthy one and unhealthy one is deleted, this scenario seems like the same as when you create a route, but send no request. then send a request, the health check then start.

@wodingyang
Copy link
Author

This issue should be very clear. When ETCD enables key compression, each compression will cause part of the upstream cache to become invalid and reload. Moreover, the official heartbeat strategy does not trigger without user requests. When these two issues occur at the same time, problematic nodes will be added back to the upstream without being kicked out, leading to abnormal user requests. I would like to ask if the official will acknowledge and resolve this issue?

@membphis
Copy link
Member

  • set etcd compression policy to 2 minutes : etcd --auto-compaction-retention=2m
  • start apisix and create upstreams and routes as described in issue description

the best practice way:

When compressing etcd data, keep more recent historical records (such as 1k records). Then, the lua client will not get compact errors.

@Vacant2333
Copy link
Contributor

I think we have found the reason and how to reproduce this issue:

Reason:

In the code shown in the following figure, when Etcd undergoes a compaction action, it will result in the need_ reload is marked as true.
https://github.com/Vacant2333/apisix/blob/a18573a3b8775f5a92ba92b948e8212affbf7057/apisix/core/config_etcd.lua#L394
image

Then it will flush all cache related to this piece of content (including healthcheck of the nodes).
https://github.com/Vacant2333/apisix/blob/a18573a3b8775f5a92ba92b948e8212affbf7057/apisix/core/config_etcd.lua#L362
image

Q: Why it will happen(etcd compaction)?
A: Perhaps the user has set auto-compaction on ETCD or run etcdctl compaction X.

Q: Why APISIX needs reload when compaction happend?
A: Let me give an example:

  1. Start APISIX:
    The revisions for APISIX and Etcd are both 60, which means they are in a synchronized state.
  2. Added a Route through APISIX:
    At this time, the revision for APISIX is still 60, but the latest revision for Etcd is actually 61.
  3. Before APISIX sync the new revision, manually/automatically execute etcdctl compaction 61:
    At this point, Etcd cannot synchronize the content of revision 61 to APISIX because revisions 1-61 have already been Compact (i.e. snapshot in Raft), and APISIX can only choose full synchronization (which is usually done incrementally by adding revision). Therefore, full synchronization must be done.

Reporduce steps:

  1. Start APISIX and Etcd:
    APISIX 3.0 and below are easier to reproduce.

  2. Create a Route:

curl "http://127.0.0.1:9180/apisix/admin/routes/6" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d '
{
  "methods": ["GET"],
  "host": "example.com",
  "uri": "/anything/*",
  "upstream": {
    "type": "roundrobin",
    "nodes": {
      "httpbin.org:80": 1
    }
  }
}'
  1. Get the newly revision and run compaction:
# Get the newly revision id
etcdctl endpoint status --write-out=json | jq .[0].Status.header.revision
# Run compaction
etcdctl compaction {Latest-revision}
  1. Wait 2 minutes:
    image

How to resolve this problem:

  1. Update APISIX to version 3.2 LTS at least.
  2. Avoid clearing the latest revisions (such as the latest 1000).

For now, i think the issue could close now, cc @shreemaan-abhishek

@wodingyang
Copy link
Author

This issue can be reproduced in version 3.6, so upgrading to any version above 3.2 will not resolve it.
As for setting the reserved revisions to 1000, although it can prevent this situation from occurring frequently, it does not mean that it will not happen 100% of the time. If there is no thorough solution to this problem, this method can be used as a workaround.

@membphis
Copy link
Member

As for setting the reserved revisions to 1000, although it can prevent this situation from occurring frequently, it does not mean that it will not happen 100% of the time. If there is no thorough solution to this problem, this method can be used as a workaround.

this issue can not be resolved by APISIX self.

the real reason:etcd latest revision history is compacted, before they has had time to be published to the gateway instance. we need to change the way about how to compact the etcd history record.

so the good way is:

keep the latest revision history for a while(maybe more than 10mins or 1 hour, it is all fine) when we compact etcd data.

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Apache APISIX backlog Jan 19, 2024
@Fijileijun
Copy link

This issue can be reproduced in version 3.6, so upgrading to any version above 3.2 will not resolve it. As for setting the reserved revisions to 1000, although it can prevent this situation from occurring frequently, it does not mean that it will not happen 100% of the time. If there is no thorough solution to this problem, this method can be used as a workaround.

Anyway, it's a good idea to get goal in roundabout way. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discuss
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

9 participants