Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KIC looping forever on Context Deadline exceeded #2440

Closed
1 task done
aroundthecode opened this issue Apr 26, 2022 · 3 comments
Closed
1 task done

KIC looping forever on Context Deadline exceeded #2440

aroundthecode opened this issue Apr 26, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@aroundthecode
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I've a kong 2.7.1 + KIC 2.2.1 setup con GKE
I've ~50 namespace with 2 ingress resources each resulting in ~30 rules per namespaces so a total amount of ~1500 rules.
Some services aslo has plugins configured.

Once deployed KIC, starts to populate kong, but it periopdicaly fails on error such:
{"error":"2 errors occurred: failed to sync all entities: context deadline exceeded while processing event: {Update} upstream *******.8080.svc failed: making HTTP request: Put "http://kong-kong-admin.kong-system:8001/upstreams/1e7b3b64-fdbf-4f51-9c9e-2cb43172d92b": context deadline exceeded ", "level":"error", "msg":"could not update kong admin", "subsystem":"proxy-cache-resolver"}

on kong admin side matching error is:
[error] 1097#0: *54395 [lua] events.lua:364: post_local(): worker-events: dropping event; waiting for event data timed out

I've split my deployment in 3 standalone kong proxy/admin pods + 3 KIC pods using the kong services to spread load across the 3 admin nodes.

I've also tried to scale up kong pods up to 5 and reduce KIC concurrency to 1 but still error persists (the number in error log is usually concurrency number+1 )

One KIC enters such error it continues looping on different services names flooding admin with requests.

Kong pods have no resources limit and each kong pod is allocated on a different VM, both CPU and RAM are NOT saturated.

Is there any other configuration tuning I can apply to avoid such loop?

Expected Behavior

KIC should be able to send all configuration to kong and stop looping

Steps To Reproduce

No response

Kong Ingress Controller version

No response

Kubernetes version

No response

Anything else?

both kong and KIC are deployed via helm with such setup (concurrency has been lowered up to 1):

ingressController:
  installCRDs: false
  enabled: true
  env:
    kong_admin_tls_skip_verify: true
    kong_admin_url: "http://kong-kong-admin.kong-system:8001"
    publish_service: kong-system/kong-kong-proxy
    kong_admin_concurrency: 3
    log_format: json
    log_level: warn

and such custom setting in nginx template:

proxy_buffer_size   128k;
    proxy_buffers   4 256k;
    proxy_busy_buffers_size 256k;

such setups also suffers from #2422, I'm waiting for next release to upgrade to 2.8

@aroundthecode aroundthecode added the bug Something isn't working label Apr 26, 2022
@aroundthecode
Copy link
Author

Adding some detail: KIC seem so be looping due to some evaluated diff on upstream which are default based (since I didn't configured them)

Note the minus sign in the trace below

updating upstream ***********-8e2f7.8080.svc  {
   "algorithm": "round-robin",
   "hash_fallback": "none",
   "hash_on": "none",
   "hash_on_cookie_path": "/",
   "healthchecks": {
     "active": {
       "concurrency": 10,
       "healthy": {
         "http_statuses": [
           200,
           302
         ],
         "interval": 0,
         "successes": 0
       },
       "http_path": "/",
-      "https_verify_certificate": true,
       "timeout": 1,
       "type": "http",
       "unhealthy": {
         "http_failures": 0,
         "http_statuses": [
           429,
           404,
           500,
           501,
           502,
           503,
           504,
           505
         ],
         "interval": 0,
         "tcp_failures": 0,
         "timeouts": 0
       }
     },
     "passive": {
       "healthy": {
         "http_statuses": [
           200,
           201,
           202,
           203,
           204,
           205,
           206,
           207,
           208,
           226,
           300,
           301,
           302,
           303,
           304,
           305,
           306,
           307,
           308
         ],
         "successes": 0
       },
-      "type": "http",
       "unhealthy": {
         "http_failures": 0,
         "http_statuses": [
           429,
           500,
           503
         ],
         "tcp_failures": 0,
         "timeouts": 0
       }
     },
-    "threshold": 0
   },
   "id": "1bdd4aa9-ab7c-45ff-a423-c626a00eb29f",
   "name": "console-webapp.pzucchett-8e2f7.8080.svc",
   "slots": 10000,
   "tags": [
     "managed-by-ingress-controller"
   ]
 }

Once I add a KongIngress in order to add the "missing" values the loop on the upstream ended and started looping on Services items with same "diff" logic:

updating service **********.pnum-8080  {
   "connect_timeout": 60000,
-  "enabled": true,
   "host": "**********-fa3dd.8080.svc",
   "id": "413fe3bf-dd7f-4e6f-b315-5110184c0f4d",
   "name": "**********.pnum-8080",
   "path": "/api",
   "port": 80,
   "protocol": "http",
   "read_timeout": 60000,
   "retries": 5,
   "tags": [
     "managed-by-ingress-controller"
   ],
   "write_timeout": 60000
 }

@aroundthecode
Copy link
Author

Hi there, I manage to update to kong 2.8.1 + KIC 2.3.1 and all loops/traces totally disappear.
No changes performed on ingress nor service configuration.

After a week of struggling I really think that kong 2.7.* + KIC 2.2.* should be considered an unstable release and removed from the available downloads!

@rainest
Copy link
Contributor

rainest commented Apr 29, 2022

Increasing CONTROLLER_PROXY_TIMEOUT_SECONDS to a value higher than the default (10) can avoid that error if it's happening under normal circumstances. Best guess is that the database backing this instance wasn't able to handle the amount of load generated by the upstream thrashing, which is fixed in 2.3.

We do not remove older images because doing so would break instances that use them unexpectedly if, say, the scheduler tried to start a replica on a worker without a cached copy of that image. Not all users are equally affected by all issues. Many did run fine despite the unnecessary updates--those were a fairly long-standing issue that we only recently acquired the means to fix effectively.

@rainest rainest closed this as completed Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants