-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: In the case where heartbeat checking is enabled, unhealthy nodes are inexplicably added back and then kicked out again. #10500
Comments
@wodingyang Are there any steps to reproduce? For example, etcd starts the compression policy, etc. |
v2.15.3 is very old and not in LTS. I could not reproduce this with the latest version. Recommend you to upgrade to latest. |
@moonming Indeed, we have found that enabling compression in etcd can cause this issue. |
@shreemaan-abhishek We have confirmed that this is a hidden bug caused by enabling compression in etcd. The active check itself is not the problem, but when etcd compression is enabled, it triggers a reload of the /upstream that APIsix is listening to, causing the original check to fail. Even the latest version of etcd still experiences this issue. |
The revision of etcd should not change, even if etcd is compressed. So how is this /upstream reload triggered? |
If the configuration of the control plane does not change, but etcd is compressed, the revision of etcd will not change. At this time, the APISIX data plane will not be configured synchronously, and the cache will not be cleared. To give an example: the etcd revision at 9 a.m. is 100. At 9:10, etcd starts to compress the data, and new changes are also written to etcd at the same time. The latest revision becomes 110. When the compression is completed, the data in revisions from 1 to 109 are gone. The solution can be to not clear all historical revisions when performing etcd compression, but to retain some of the most recent revisions, such as the most recent 1,000. Foe example use You can try this solution. If you have any questions, you can update here. |
The suggested configuration can only mitigate the issue but may not completely resolve it. We are currently puzzled by the fact that the "/apisix/upstreams" key has not changed, and our system automatically reports data for our custom keys. In theory, the listener for "/apisix/upstreams" should not be triggered. We are planning to reduce the compression frequency to once a day, but we still hope to identify and resolve this problem at its root. |
@wodingyang I have create a test case for reproduce this issues. can you help me take a look ? if you agress my test case can you close this issues ? we will track this promble by another issue. |
From APISIX side, can we cache the last revision to mitigate this issue ? |
As long as etcd enables the compression policy, executes the compression, and notifies /apisix/upstreams to update the cache, this issue will be reproduced. Moreover, the notification to /apisix/upstreams is not only sent when modifying upstreams, I am also unclear under what circumstances etcd compresses the key and updates this. |
OK. I will take over this issue. pls assign to me |
It seems like I don't have the permission to assign. |
@shreemaan-abhishek , can you please assign me this ? |
we also suffered this problem |
could you please share your compation configurations? |
I was able to force update the revision number during compression which led to complete reload of apisix resources in the cache. But this didn't lead to Steps to repro:
while true
do
ectl put a b
done
while true
do
sleep 0.5
curl -i "http://127.0.0.1:9080/hello"
done
|
@wodingyang, even if you do not change the "/apisix/upstreams" key, the etcd revision can be updated by modifying other keys. |
at first, I add the upstream, like below: the 172.20.xxx.xxx is OK,the 172.30.xxx.xxx is bad.
} ` then I create a route which bind the upstream. and I send a request to this route, It returns the result as I expect, then I invoke http://apisix:9090/v1/healthcheck, the return is : `
` then I modify the etcd config auto-compaction-retention=1m after a while, the apisix log as following:
` then I invoke http://apisix:9090/v1/healthcheck the result returns an empty table {} then I send a request to my test route, It also return the data I want, but the response code is 502,200 as the author say, he thinks that health check readded the unhealthy one, but I think the reason is that both healthy one and unhealthy one is deleted, this scenario seems like the same as when you create a route, but send no request. then send a request, the health check then start. |
This issue should be very clear. When ETCD enables key compression, each compression will cause part of the upstream cache to become invalid and reload. Moreover, the official heartbeat strategy does not trigger without user requests. When these two issues occur at the same time, problematic nodes will be added back to the upstream without being kicked out, leading to abnormal user requests. I would like to ask if the official will acknowledge and resolve this issue? |
the best practice way: When compressing etcd data, keep more recent historical records (such as 1k records). Then, the lua client will not get compact errors. |
I think we have found the reason and how to reproduce this issue: Reason:In the code shown in the following figure, when Then it will flush all cache related to this piece of content (including healthcheck of the nodes). Q: Why it will happen(etcd compaction)? Q: Why
Reporduce steps:
curl "http://127.0.0.1:9180/apisix/admin/routes/6" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d '
{
"methods": ["GET"],
"host": "example.com",
"uri": "/anything/*",
"upstream": {
"type": "roundrobin",
"nodes": {
"httpbin.org:80": 1
}
}
}'
How to resolve this problem:
For now, i think the issue could close now, cc @shreemaan-abhishek |
This issue can be reproduced in version 3.6, so upgrading to any version above 3.2 will not resolve it. |
this issue can not be resolved by APISIX self. the real reason: so the good way is: keep the |
Anyway, it's a good idea to get goal in roundabout way. 😄 |
Current Behavior
Configure two nodes in the upstream, one healthy and one unhealthy, and enable active heartbeat checking. Then initiate a user request, and when heartbeat health is enabled, the unhealthy node is kicked out. After waiting for a while, it is observed that the unhealthy node is added back and then re-triggered for eviction.During this waiting period, regardless of whether there are user requests or not, the unhealthy node is added back. In our testing, we disabled retries. After a continuous series of requests, it was observed that the unhealthy node was added back. The testing system received a significant number of 502 errors, and upon checking the logs of API Gateway, it was found that these 502 errors were forwarded to the unhealthy node.
Expected Behavior
We attempted to inspect the source code and found in upstream.lua that...
The first time the heartbeat is enabled, it calls
healthcheck.new()
, and subsequent requests go through the logic ofif healthcheck_parent.checker and healthcheck_parent.checker_upstream == upstream then
. However, after a certain period of time, thisif
statement becomes ineffective and a new check is created, while the previous check is deleted. We are unsure about the reason behind thisif
statement and would appreciate assistance in identifying the issue.The time for adding unhealthy nodes back is highly variable, sometimes around one minute and other times around five minutes. Is there any caching or mechanism involved in this process?
Error Logs
No response
Steps to Reproduce
Step 1: Configuring the upstream and route
Step 2: Continuously initiate user requests
Step 3: Check the logs and receive
Step 4: Receive again after a certain period of time
Environment
apisix version
):2.15.3uname -a
):centos7openresty -V
ornginx -V
):1.21.4.1The text was updated successfully, but these errors were encountered: