Race condition(?) when running clustered in Kubernetes #2125

DonMartin76 · 2017-02-24T12:15:08Z

Summary

At times, when running several (3+) Kong instances at once in a clustered environment inside Kubernetes, and I (for some reason or the other, like draining nodes) kill all associated containers/pods, it sometimes happens that I get into a "split brain" kind of cluster state: One Kong instance thinks it's alone in a cluster, while the other two think they make up a cluster of two (when checking inside the containers using the kong CLI).

It's not always reproducible, but once in a while this happens, and the effect is obviously that some changes done on the admin API do not propagate to the other "side", and that calls to the configured APIs are met with a 503 ("No API defined for...") instead of the real response.

The workaround for this is to not let Kubernetes start more than one Kong instance at once; if I scale up "slowly" (like every 5 seconds or so), this problem does not occur.

If I don't "restart" the entire Kong cluster at once, this does not occur, it seems to be some kind of race condition if two Kong instances both claim to be the "first" instance in the cluster. In cases this does not happen, everything works perfectly.

Is there something I might be doing wrong, something I can check, and/or do to make this not happen?

Additional Details & Logs

Kong 0.9.9
Operating System: Official Docker image

The text was updated successfully, but these errors were encountered:

shashiranjan84 · 2017-02-24T23:10:15Z

@DonMartin76 Can you check what cluster_listening_address saved in the DB, you may have to log into the backing DB pod and query nodes table.

Also please try explicitly setting KONG_CLUSTER_LISTEN with <pod_ip:serf_port> like following template

apiVersion: v1
kind: Service
metadata:
  name: kong-proxy
spec:
  type: LoadBalancer
  loadBalancerSourceRanges:
  - 0.0.0.0/0
  ports:
  - name: kong-proxy
    port: 8000
    targetPort: 8000
    protocol: TCP
  - name: kong-proxy-ssl
    port: 8443
    targetPort: 8443
    protocol: TCP
  selector:
    app: kong

---
apiVersion: v1
kind: Service
metadata:
  name: kong-admin
spec:
  type: LoadBalancer
  loadBalancerSourceRanges:
  - 0.0.0.0/0
  ports:
  - name: kong-admin
    port: 8001
    targetPort: 8001
    protocol: TCP
  selector:
    app: kong

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: kong-rc
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: kong-rc
        app: kong
    spec:
      containers:
      - name: kong
        image: mashape/kong:0.10.0rc3
        env:
          - name: KONG_PG_PASSWORD
            value: kong
          - name: KONG_PG_HOST
            value: postgres.default.svc.cluster.local
          - name: KONG_DNS_RESOLVER
            value: 10.100.0.10
          - name: KONG_HOST_IP
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: status.podIP
        command: [ "/bin/sh", "-c", "KONG_CLUSTER_ADVERTISE=$(KONG_HOST_IP):7946 KONG_NGINX_DAEMON='off' kong start && env" ]
        ports:
        - name: admin
          containerPort: 8001
          protocol: TCP
        - name: proxy
          containerPort: 8000
          protocol: TCP
        - name: proxy-ssl
          containerPort: 8443
          protocol: TCP
        - name: surf-tcp
          containerPort: 7946
          protocol: TCP
        - name: surf-udp
          containerPort: 7946
          protocol: UDP

edwardjrp · 2017-03-02T23:07:57Z

Im experiencing a similar issue. Launching multiple replicas of kong. Sometimes works others dont. Seems a race condition possibly related to migration being run by every pod. Configuration for cluster listening is KONG_CLUSTER_LISTEN: 0.0.0.0:7946, this tell kong to pick the first non-loopback ip available for the pod, and KONG_CLUSTER_LISTEN_RPC" 127.0.0.1:7373.

The identifiable pattern that i find is that when it works and clustering finds everyone is because 1 of the pods ran the migrations completely and the rest just go as usual startup.

Any thoughts?

thibaultcha · 2017-03-03T00:02:12Z

@edwardjrp Migrations are not supposed to be run concurrently from multiple Kong instances. They should be run from a single node, eventually from the kong migrations command. Once they are completed, and only then, should you start your Kong nodes.

edwardjrp · 2017-03-03T18:30:45Z

Thank you @thibaultcha , i figured something was off due to migrations. Any pattern for this worth sharing? thanks again.

cknowles · 2017-03-05T03:04:44Z

I hit something similar while trying an upgrade from 0.9.9 to 0.10 RC4. In my case I only have one replica but the way a deploy works in k8s means there can be two for a short period. For some reason the new instance did not respond to any admin or proxy requests until I added KONG_CLUSTER_LISTEN: 0.0.0.0:7946. However, after the migrations ran I took that out again, updated the deployment and it's still. Sorry, I don't have many other details right now other than that.

I suppose how this is handled mainly depends on whether the DB migrations are always backwards compatible for at least one release. i.e. whether they can be rolled while some connected Kong servers are still up. If they are not, the only way to handle this seems to be an entire new deploy+db with a data export. If they are, is there a way to prevent kong start from running migrations? I can't spot a way to disable this from the config and code. I understand why it's there by default but maybe under this scenario it would be good to have more external control over the rollout?

cknowles · 2017-03-05T03:21:43Z

Actually I've just found not having KONG_CLUSTER_LISTEN: 0.0.0.0:7946 on 0.10 RC4 in k8s makes the entire deployment not work not just the migrations. So that part is likely unrelated to this issue.

edwardjrp · 2017-03-07T22:24:43Z

I managed a workaround for migrations race conditions on kubernetes. Essentially by running migrations from a k8s job and forcing pods to wait until the job finishes by implementing init container logic on the kong pods definition. Hope this approach may be useful for someone else out There.

DonMartin76 · 2017-03-08T07:04:47Z

@edwardjrp Mind to share your configuration? My problem may also be related to migrations.

shashiranjan84 · 2017-03-08T23:18:43Z

@edwardjrp regarding Kong clustering issue, please try setting KONG_CLUSTER_ADVERTISE to pod IP, please refer updated deployment file
https://github.com/Mashape/kong-dist-kubernetes/blob/master/kong_postgres.yaml

cknowles · 2017-03-16T15:50:13Z

@edwardjrp if you wish to share your method to avoid this race in k8s, I'd love to work together to get it added into the Helm Chart I've submitted a PR for.

edwardjrp · 2017-03-16T20:58:09Z

@c-knowles the minute i get a chance ill submit my solution, thou not the cleanest since it uses internal k8s api to check job status and force using init containers to wait until migration runs.

cknowles · 2017-03-17T00:36:10Z

@edwardjrp great, no rush at all. We can work to tidy it up in the incubator perhaps.

endeepak · 2017-08-29T06:32:08Z

@shashiranjan84 @thibaultcha we are also facing split brain issue with kong inside docker swarm. We are running kong:0.9.9 with 3 replicas. I've checked that port7946 (tcp and udp) is reachable from each node to other 2 nodes.

Nodes table has following entries

# select * from nodes limit 50;
                             name                             | cluster_listening_address |     created_at
--------------------------------------------------------------+---------------------------+---------------------
 e40cb51550a8_10.0.2.21:7946_639f2b5ee5eb44399d22e2dcc9001c14 | 10.0.2.21:7946            | 2017-08-24 09:46:24
 cfcff4f43698_10.0.2.20:7946_467d6529018c4d0ea0517151c94a37bd | 10.0.2.20:7946            | 2017-08-24 09:46:25
 af40fed6042e_10.0.2.19:7946_be33c1d0e7c94ce297af182385c70e49 | 10.0.2.19:7946            | 2017-08-24 09:46:27
(3 rows)

Cluster API returns following output

$ curl kong:8001/cluster
{
	"data":[
		{"address":"10.0.2.20:7946","name":"cfcff4f43698_10.0.2.20:7946_467d6529018c4d0ea0517151c94a37bd","status":"alive"},
		{"address":"10.0.2.21:7946","name":"e40cb51550a8_10.0.2.21:7946_639f2b5ee5eb44399d22e2dcc9001c14","status":"alive"}
	],
	"total": 2
}
$ curl kong:8001/cluster
{
	"data":[
		{"address":"10.0.2.19:7946","name":"af40fed6042e_10.0.2.19:7946_be33c1d0e7c94ce297af182385c70e49","status":"alive"}
	],
	"total":1
}

There are no errors in serf.log

edwardjrp · 2017-08-29T14:29:52Z

@endeepak i encorage you to move to kong 0.11.0. It has pretty cool new things, bug fixed and fine enhancements including better clustering without using serf.
Here is how to upgrade. Its pretty straightforward and headache free.
https://github.com/Mashape/kong/blob/master/UPGRADE.md#upgrade-to-011x

thibaultcha · 2017-11-02T04:56:05Z

I would strongly advise following @edwardjrp's advice as well... The clustering support in 0.10 was indeed subject to race conditions. Our recommended approach to a better clustering experience is an upgrade to Kong 0.11. Kong 0.11 also strongly enforces a manual (or at least, isolated) migration process, which is much safer than Kong 0.10's "automatic" migration behavior with kong start...

I will be closing this now, sorry for the delay on our side!

thibaultcha added the task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not. label Feb 28, 2017

pamiel mentioned this issue Mar 10, 2017

“Ghost” kong instances never disappear and drastically slow down the startup process #2192

Closed

thibaultcha added support and removed task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not. labels Apr 28, 2017

thibaultcha closed this as completed Nov 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition(?) when running clustered in Kubernetes #2125

Race condition(?) when running clustered in Kubernetes #2125

DonMartin76 commented Feb 24, 2017

shashiranjan84 commented Feb 24, 2017

edwardjrp commented Mar 2, 2017 •

edited

Loading

thibaultcha commented Mar 3, 2017

edwardjrp commented Mar 3, 2017

cknowles commented Mar 5, 2017

cknowles commented Mar 5, 2017

edwardjrp commented Mar 7, 2017 •

edited

Loading

DonMartin76 commented Mar 8, 2017

shashiranjan84 commented Mar 8, 2017

cknowles commented Mar 16, 2017

edwardjrp commented Mar 16, 2017

cknowles commented Mar 17, 2017

endeepak commented Aug 29, 2017

edwardjrp commented Aug 29, 2017

thibaultcha commented Nov 2, 2017

Race condition(?) when running clustered in Kubernetes #2125

Race condition(?) when running clustered in Kubernetes #2125

Comments

DonMartin76 commented Feb 24, 2017

Summary

Additional Details & Logs

shashiranjan84 commented Feb 24, 2017

edwardjrp commented Mar 2, 2017 • edited Loading

thibaultcha commented Mar 3, 2017

edwardjrp commented Mar 3, 2017

cknowles commented Mar 5, 2017

cknowles commented Mar 5, 2017

edwardjrp commented Mar 7, 2017 • edited Loading

DonMartin76 commented Mar 8, 2017

shashiranjan84 commented Mar 8, 2017

cknowles commented Mar 16, 2017

edwardjrp commented Mar 16, 2017

cknowles commented Mar 17, 2017

endeepak commented Aug 29, 2017

edwardjrp commented Aug 29, 2017

thibaultcha commented Nov 2, 2017

edwardjrp commented Mar 2, 2017 •

edited

Loading

edwardjrp commented Mar 7, 2017 •

edited

Loading