Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After updating deployment, controller failed to sync pod to kong #1025

Closed
yuyulei opened this issue Feb 3, 2021 · 13 comments
Closed

After updating deployment, controller failed to sync pod to kong #1025

yuyulei opened this issue Feb 3, 2021 · 13 comments
Labels
pending author feedback stale Will be closed unless advocated for within 7 days

Comments

@yuyulei
Copy link

yuyulei commented Feb 3, 2021

Summary

SUMMARY_GOES_HERE

Kong Ingress controller version
v1.1.0

Kong or Kong Enterprise version
unknown

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:30:33Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.9-aliyun.1", GitCommit:"4f7ea78", GitTreeState:"", BuildDate:"2020-05-08T07:29:59Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

paste kubectl version output

Environment

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

What happened

After update the deployment, old pod deleted, new pod created, but targets of the upsteam is missing. I found controller logs:

time="2021-02-03T05:39:18Z" level=info msg="syncing configuration" component=controller
time="2021-02-03T05:39:18Z" level=error msg="failed to update kong configuration: 1 errors occurred:\n\twhile processing event: {Create} failed: Not found\n" component=controller
time="2021-02-03T05:39:18Z" level=error msg="failed to sync: 1 errors occurred:\n\twhile processing event: {Create} failed: Not found\n" component=sync-queue
time="2021-02-03T05:39:18Z" level=warning msg="requeuing sync for 'default/cli-im-net'" component=sync-queue

Then I delete pods of deployment(cli-im-net), then new pod created and new target appears.

I also found some strange logs from kong, like

2021/02/01 09:11:47 [warn] 21#0: *3 [lua] ring.lua:246: redistributeIndices(): [upstream:cli-im-net.default.80.svc 13] redistributed indices, size=10000, dropped=10000, assigned=0, left unassigned=10000, context: ngx.timer

maybe it helps.

Expected behavior

expect the consistence between K8s pod OPs(like create/delete) and kong targets

Steps To Reproduce

randomly, some update op can reproduce.

@yuyulei
Copy link
Author

yuyulei commented Feb 4, 2021

I found a pod not appear in konga, then tried to add it by admin api and succeed, but this scenario should not happen, because ingress-controller will do it.

image

@rainest
Copy link
Contributor

rainest commented Feb 19, 2021

What do you see in the proxy container access logs? Can you confirm there's a POST /upstreams/UPSTREAM_ID/targets HTTP/1.1" 404 request specifically? If so, do you see a successful "PUT /upstreams/UPSTREAM_ID HTTP/1.1" 200 request for the same ID earlier?

The error the controller gets back won't indicate what wasn't found, but 404s in the access log will.

What was the deployment update? Did you update the Kong image version from 2.1.x to 2.2.x?

We received a similar report following a 2.1 to 2.2 upgrade elsewhere, but still aren't sure of the root cause: the apparent successful creation/update of the upstream should mean that the target POSTs after succeed.

@Starefossen
Copy link

We are seeing similar behaviour as reported. One of our Kong Ingress Controller pods fails with failed to update kong configuration from time to time. This results in routes being not updated and some requests are failing.

 cat kong-controller.log | grep "failed to update kong configuration"
time="2021-02-25T09:58:12Z" level=error msg="failed to update kong configuration: posting new config to /config: making HTTP request: Post \"https://localhost:8444/config?check_hash=1\": unexpected EOF" component=controller
time="2021-03-12T10:31:48Z" level=error msg="failed to update kong configuration: posting new config to /config: making HTTP request: Post \"https://localhost:8444/config?check_hash=1\": unexpected EOF" component=controller
time="2021-03-16T09:02:51Z" level=error msg="failed to update kong configuration: posting new config to /config: making HTTP request: Post \"https://localhost:8444/config?check_hash=1\": dial tcp 127.0.0.1:8444: connect: connection refused" component=controller
time="2021-03-16T09:02:55Z" level=error msg="failed to update kong configuration: posting new config to /config: making HTTP request: Post \"https://localhost:8444/config?check_hash=1\": dial tcp 127.0.0.1:8444: connect: connection refused" component=controller
time="2021-03-16T09:02:58Z" level=error msg="failed to update kong configuration: posting new config to /config: making HTTP request: Post \"https://localhost:8444/config?check_hash=1\": dial tcp 127.0.0.1:8444: connect: connection refused" component=controller
time="2021-03-16T15:47:10Z" level=error msg="failed to update kong configuration: posting new config to /config: making HTTP request: Post \"https://localhost:8444/config?check_hash=1\": unexpected EOF" component=controller
time="2021-03-17T21:00:05Z" level=error msg="failed to update kong configuration: posting new config to /config: making HTTP request: Post \"https://localhost:8444/config?check_hash=1\": unexpected EOF" component=controller

There are no logged PUT requests in the proxy container.

@rainest
Copy link
Contributor

rainest commented Mar 18, 2021

@Starefossen your issue is something different.

The original issue can only happen in DB-backed mode, where the requests to add upstreams and targets are separate. When the issue occurs, the target request fails, because the upstream is (supposedly) not present, even though the request to add it succeeded.

There's no such separation in DB-less mode, where the POST /config requests contain the entire configuration in a single request. Your failure message indicates that the admin API listen isn't sending a proper HTTP response (unexpected EOF) or isn't open at all (connection refused). Other errors around that time in the proxy container logs pointing to something crashing or resource exhaustion would be my initial guesses for that sort of failure.

@cpcostaneves
Copy link

cpcostaneves commented Mar 19, 2021

Updated March 22 with the root causes and how to solve it (at least in our case)

We are having a similar issue really often. After a re deployment, Ingress controller is unable to post to kong the target IPs to one specific upstream (different each time). This leads ingress controller to a loop trying to resync, and sometimes even controller stops to log and to update IPs for all services/pods, requiring us to restart kong pods.
Kong logs show http post upstream/target return 404. Ing controller logs show “{Create} failed: Not found”.

Kong Ingress controller version: 1.1
Kong version: 2.2.0
Helm chart: kong-1.14.2 (repo https://charts.konghq.com)
Kubernetes version: 1.17.15
Environment: GKE , Container-Optimized OS with Docker ( cos )
Number of replicas: 2
Database: postgresql 9.6

We also see the same issue in on premises env: Kubernetes version: 1.19, Rancher RKE, Ubuntu 20.04, docker 19, postgresql 11.

What happened
Actions triggering a pod ip change (helm upgrade, kubectl rollout restart) may trigger this issue. Ingress controller is not able to post to kong the new target IPs to one specific upstream and enters a loop trying to resync everything. Kong logs show http post upstream/target return 404. Ing controller logs show “{Create} failed: Not found”. Sometimes controller stops to log anything and to update IPs for all services/pods.
On the first occurrences, after a kong pods restart, ing controller started normally and resync all targets. First next updates were successful, but after a while, issue happened again, and a new kong restart was required.
But at one point issue was happening on ing controller startup, even after multiple restarts. Every time entered in a resync loop and hanged (no more logs even after hours).

Causes and solution (at least in our case)

  • on kong 2.1, when a target was deleted, it created another target entry in postgres with weight=0 (meaning disabled) and still leaving the original target there. New target with same IP:port as the disabled one could be added with success.
  • kong 2.2 changed this to actually delete target from postgres. And not accepting post a target with an IP:port already in database.
  • migration job from 2.1 to 2.2, not removed all targets with weight=0. It removed the original disabled targets and some targets with weight=0, but some still left in database.
  • when ingress controller tried to post a target with same IP:port already in database (with weight=0), kong returned the 404 not found.
  • the solution is delete in postgres all old targets with weight=0.

This is the cause and solution for the '{Create} failed' / post 404 issue. But still unknown why ingress controller when kept a long time in resync loop (failing with post 404), sometimes, stopped to log and to resync everything, requiring a kong restart.

Notes

  • kong pods were healthy all the time (used capacity was fine). Even when ing controller stopped to log, I was able to bash container and see process was running.
  • When ing controller was in the resync loop, was possible to manually post an IP to the specific upstream using admin API (returned 201 created).
  • When ing controller hanged we couldn't see any relevant log (usually last log was msg="syncing configuration" component=controller).
  • On Dec/2020 we upgraded kong from 2.1 to 2.2 and ing controller from 0.10 to 1.0. After, we started to see in the ing controller logs, an error when creating/updating the upstream itself, about formatting error.
    level=error msg="failed to update kong configuration: 1 errors occurred:\n\twhile processing event: {Update} failed: making HTTP request: Put \"https://localhost:8444/upstreams/903c420f-ae6d-4f04-8656-89650ee70d9a\": unexpected EOF\n" component=controller level=error msg="failed to sync: 1 errors occurred:\n\twhile processing event: {Update} failed: making HTTP request: Put \"https://localhost:8444/upstreams/903c420f-ae6d-4f04-8656-89650ee70d9a\": unexpected EOF\n" component=sync-queue

History

  • no issues with kong and ing controller
  • Dec 2020 upgraded kong from 2.1 to 2.2 and ing controller from 0.10 to 1.0
  • started to see "{Update} failed: making HTTP request..." in logs, but ing controller was always able to resync
  • Feb 8 2021, kong chart was upgraded from 1.12.0 to 1.14.2 and controller upgraded from 1.0 to 1.1.
  • we continued to see "{Update} failed:..." , but ing controller was always able to resync
  • Mar 8 2021, added 2 more ingresses (a frontend in nodejs and konga), and configured prometheus plugin using CRD (before was manually added).
  • Mar 15 2021, first time issue was seen.
  • issue became frequent in the next days, with 2 upstreams.

Logs (using debug level and --v=3)
https://gist.github.com/cpcostaneves/9e84c5fbbaf5b994ce91ff901ce99799

Steps To Reproduce

Lot of ingresses configured (13), deploy kong 2.1, then upgrade to 2.2. We created a bash script with a for loop performing rollout restarts, for 3 services, 1 operation at a time, waiting to complete before proceed to next.

@rainest
Copy link
Contributor

rainest commented Mar 23, 2021

Interesting. While I can see that we didn't remove those targets during the 2.2 upgrade, I'm still unsure exactly when these 404s happen. It doesn't appear that simply having zero-weight target(s) causes this and that there's some other condition necessary to generate 404s.

The workaround to delete the offending targets is reasonable, as they're not bound to anything else in the database (only to runtime health status data), so, in Postgres:

delete from targets where weight=0;

will clear them.

To try and artificially replicate this, I tried:

  • Adding a zero-weight target and then a nonzero-weight target. Both succeeded.
  • Adding a zero-weight target and then a nonzero-weight target with the same IP. This resulted in a 409 for the second, not a 404.
  • Adding a larger number of zero-weight targets (200) to try and catch some sort of pagination issue before adding a nonzero-weight target.

There is some legacy logic to update targets via POST that I thought might be the culprit, but I'm unsure when that's actually used or whether/why it would 404: my attempt to use the same IP was what I thought would use that, but as it didn't trigger a 404, there's still something missing if that's related.

@cpcostaneves what was the replica count of your largest affected upstream Deployment, and how many times did you restart it on 2.1, if you recall? My choice of 200 was arbitrary; this may require a larger number of zero-weight targets if their number is related.

@cpcostaneves
Copy link

cpcostaneves commented Mar 23, 2021

@rainest ingress controller will not be aware of the zero weight targets, since tag is also null, and will use a different id when try to post (this is actually the real issue).

so after adding a zero weight target, try to add again with same IP:port but change the id

You can try the following (replace upstream id by one you have, and replace kong admin api)

curl -X "POST" -H "Content-Type: application/json" "http://127.0.0.1:8001/upstreams/ba79200c-b599-417c-8017-b5d9ea2d5e8b/targets" -d '{"id":"fb3d3a23-2ffb-4343-bf05-d2a74c387c57","target":"10.42.74.219:8080","upstream":{"id":"ba79200c-b599-417c-8017-b5d9ea2d5e8b"},"weight":0,"tags":null}'
curl -X "POST" -H "Content-Type: application/json" "http://127.0.0.1:8001/upstreams/ba79200c-b599-417c-8017-b5d9ea2d5e8b/targets" -d '{"id":"fb3d3a23-2ffb-4343-bf05-d2a74c387c58","target":"10.42.74.219:8080","upstream":{"id":"ba79200c-b599-417c-8017-b5d9ea2d5e8b"},"weight":100,"tags":["managed-by-ingress-controller"]}'

Note that I've changed the target id (last digit).

The return I get in kong 2.2 for the second commands is a 404 with
{"message":"could not find the entity with primary key '{id=\"fb3d3a23-2ffb-4343-bf05-d2a74c387c57\"}'","name":"not found","fields":{"id":"fb3d3a23-2ffb-4343-bf05-d2a74c387c57"},"code":6}

One suggestion for ingress controller is to handle all stored targets, not filtering by tag, since it is the owner of the upstream and should handle all targets for that target. So it can delete offending or no sense targets, before posting the new ones. This will prevent issues in wild cases, like a real user adds manually a target id.

The largest upstream I had was with 3 replicas and we re deployed around 200 times with kong 2.1 (good estimation :) )

@rainest
Copy link
Contributor

rainest commented Mar 24, 2021

Ah, okay, I think I see what I did now: the 409s are apparently a weird (and, for the controller, irrelevant) corner case that only appear if you attempt to provide a target without a port, which I'd done accidentally when trying to replicate this without the controller. I'd attempted to POST a target like:

{"upstream":{"id":"e85f61df-32f9-4634-a9ef-f99b067ec4f1"},"target":"127.0.0.3","weight":100}

Kong fills in a default port (8000) if none is provided, so that effectively tries to add "target":"127.0.0.3:8000". That apparently short-circuits the existing target lookup, so if there is a {"target":"127.0.0.3:8000","weight":0} already, the lookup won't find it, it's considered a new target, the insert fails because of the unique constraint, and you get a 409.

I hadn't realized the controller insert logic included the target ID in the request body. That appears to be key to the 404 case, though I'm not exactly sure where we generate the ID after an initial review of the KIC and decK target code. Edit: the controller does not set any target IDs, but decK generates them randomly if it cannot find an existing target. It could theoretically get away from this, but we can't use the IP+port combo alone--it'd need to search by upstream ID also.

Requests that do not provide an ID are accepted, and update the existing zero-weight target:

$ curl -sX "POST" -H "Content-Type: application/json" "http://127.0.0.1:8001/upstreams/37dd4d64-a268-4efa-ad43-b7118a420703/targets" -d '{"target":"10.42.74.215:8030","upstream":{"id":"37dd4d64-a268-4efa-ad43-b7118a420703"},"weight":0,"tags":null}' 
{"created_at":1616619816.172,"id":"0752eae0-d74b-4ea8-9513-21a2caa4cb66","tags":null,"weight":0,"target":"10.42.74.215:8030","upstream":{"id":"37dd4d64-a268-4efa-ad43-b7118a420703"}}

$ curl -sX "POST" -H "Content-Type: application/json" "http://127.0.0.1:8001/upstreams/37dd4d64-a268-4efa-ad43-b7118a420703/targets" -d '{"target":"10.42.74.215:8030","upstream":{"id":"37dd4d64-a268-4efa-ad43-b7118a420703"},"weight":100,"tags":null}' 
{"created_at":1616619816.172,"id":"0752eae0-d74b-4ea8-9513-21a2caa4cb66","tags":null,"weight":100,"target":"10.42.74.215:8030","upstream":{"id":"37dd4d64-a268-4efa-ad43-b7118a420703"}}%            

If you do provide a target ID, then you get a 404 as shown in the previous comment. Intuitively, it seems like the legacy logic is searching for that IP+port combination, finding the existing zero-weight target, but then (if you provided an ID), falls back on some of the "zero-weight targets are invisible" logic. Yeesh.

@stale
Copy link

stale bot commented Feb 18, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Will be closed unless advocated for within 7 days label Feb 18, 2022
@stale stale bot closed this as completed Feb 26, 2022
@adapasuresh
Copy link

I am having similar issuees.

@adapasuresh
Copy link

My dev team has written and java-based microservice, which does return 200 and 40X based on certain headers values.
I have used this as introspection_endpoint and deploying the kongplugin.

When this kongplugin is used in another microservice, it does not report anything and after some time it just gives me 504 -Bad Gateway response.

=====
My Env is kong with postgresql DB in AWS EKS cluster.

sureshadapa@localhost kongplugins % kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.14-eks-18ef993", GitCommit:"ac73613dfd25370c18cbbbc6bfc65449397b35c7", GitTreeState:"clean", BuildDate:"2022-07-06T18:06:50Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

helm.sh/chart: kong-3.5.0
image: docker.io/bitnami/kong:2.3.3-debian-10-r43
helm.sh/chart: postgresql-10.3.18
image: bitnami/postgresql:11.11.0-debian-10-r71 │
│ imageID: docker-pullable://bitnami/postgresql@sha256:e1e915f72f648f55356b53cc5a4bc2f0af34b27bb202cd61eb94801c20eaf883


I shall also try deleting the records with 0
delete from targets where weight=0;

@adapasuresh
Copy link

The weight is 100 for all rows in my "targets" table? I am blocked and unable to proceed further.

@xkkker
Copy link

xkkker commented Apr 21, 2023

Same problem. The ingress controller did't update the upstream after the deployment rollout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending author feedback stale Will be closed unless advocated for within 7 days
Projects
None yet
Development

No branches or pull requests

7 participants