deadlock when enabled cfn-signal on etcd nodes? #525

redbaron · 2017-04-10T21:47:47Z

It sees that there is no support of cfn-signal on etcd nodes or maybe I am doing something wrong?

deadlock goes like following:

etcd-member is of type notify which is then fired when etcd server joins the cluster
cfn-signal waits for etcd-member to become is-active which never happens until it reports readiness to systemd
cloudformation waits for cfn-signal to fire before moving on to next etcd server

etcd-member can't join cluster as it first one and seems to wait for others to pop up, but it can't happen because cloudformation wont start next etcd until first one reports success

The text was updated successfully, but these errors were encountered:

camilb · 2017-04-10T22:18:35Z

@redbaron It assign a EIP to your first node? Had something similar today, but it was because I forgot to add some parameters to VPC config. Do get this error in journald?

"Apr 10 13:05:30 ip-10-0-1-108.ec2.internal bash[1059]: run: discovery failed"

mumoshu · 2017-04-11T01:53:58Z

Good catch! It does support cfn-signal. etcdadm-reconfigure.service sets etcd-member's Type to simple or notify accordingly to the number of remaining nodes to be set up. If it fails to signal, maybe theres a bug in etcdadm 2017年4月11日(火) 7:18 Camil Blanaru <notifications@github.com>:

…

@redbaron <https://github.com/redbaron> It assign a EIP to your first node? Had something similar today, but it was because I forgot to add some parameters to VPC config. Do get this error in journald? "Apr 10 13:05:30 ip-10-0-1-108.ec2.internal bash[1059]: run: discovery failed" — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#525 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABV-YwDkUki-bB5KBz_Z6ZBVz69uI54ks5ruqq7gaJpZM4M5XbH> .

redbaron · 2017-04-11T07:01:29Z

@mumoshu , etcd.disasterRecovery.automated isn't enabled by default , therefore it disables etcdadm-reconfigure.service . I am porting our config to new version and we don't have this feature in existing one. Does it mean that this flag isn't really a flag and should always be enabled?

mumoshu · 2017-04-12T00:40:31Z

@redbaron As you've seen etcd.disasterRecovery.automated is disabled by default, but etcdadm-reconfigure.service. etcdadm-reconfigure.service is enabled as long as the specified etcd version is 3+, which is the default.

redbaron · 2017-04-12T10:50:25Z

@mumoshu , hmm, true.I'll investigate further

redbaron · 2017-04-12T10:53:40Z

@mumoshu , it sees that there is this line https://github.com/kubernetes-incubator/kube-aws/blob/dd345a7bacf74076883f4f8c21df90d18e6e1ab9/core/controlplane/config/templates/cloud-config-etcd#L154 which makes etcdadm-reconfigure.service to not start unless automation is enabled

mumoshu · 2017-04-12T12:49:29Z

@redbaron Wow!!! You are correct. It must be {{if .Etcd.DisasterRecovery.SupportsEtcdVersion .Etcd.Version -}}, too...

when wait signal is enabled. Resolves kubernetes-retired#525

mumoshu · 2017-04-12T14:26:47Z

@redbaron It is now fixed in master! Would you mind confirming if it works for you?

mumoshu · 2017-04-12T14:30:35Z

Perhaps I couldn't notice the issue due to my insufficient test setup, which contained only one etcd node. To reproduce the issue, I believe I needed at least 3 or more, odd number of etcd nodes.

…update-to-latest-kube-aws-master to hcom-flavour * commit '175217133f75b3c251536bc0d51ccafd2b1a5de4': Fix the dead-lock while bootstrapping etcd cluster when wait signal is enabled. Resolves kubernetes-retired#525 Fix elasticFileSystemId to be propagated to node pools Resolves kubernetes-retired#487 'Cluster-dump' feature to export Kubernetes Resources to S3 Follow-up for the multi API endpoints support This fixes the issue which prevented a k8s cluster from being properly configured when multiple API endpoints are defined in cluster.yaml. Fix incorrect validations on apiEndpoints Ref kubernetes-retired#520 (comment) Wait until kube-system becomes ready Resolves kubernetes-retired#467

* kubernetes-incubator/master: Don't mount /var/lib/rkt into kubelet to avoid shared bind-mounts propagation Fix to calico configuration file etcd endpoints Fix hyperlink to restore script in Readme.md. Reference 'autosave' rather than 'export' in comments of cluster.yaml. 'Restore' feature to restore Kubernetes Resources from S3 backup Add missing '/' when constructing the Autosave S3 put path Shared Persistent Volume (kubernetes-retired#471) Fix an incorrect variable name in the e2e/run script Add documentation for administrating etcd cluster Resolves kubernetes-retired#491 use gzip base64 encoding for customFiles New options: customFiles and customSystemdUnits Add cluster.yaml details for apiEndpointName Fix the dead-lock while bootstrapping etcd cluster when wait signal is enabled. Resolves kubernetes-retired#525 Fix elasticFileSystemId to be propagated to node pools Resolves kubernetes-retired#487 Minor fixup for etcd unit files Fix up apiEndpoints.loadBalancer config

when wait signal is enabled. Resolves kubernetes-retired#525

mumoshu added a commit to mumoshu/kube-aws that referenced this issue Apr 12, 2017

Fix the dead-lock while bootstrapping etcd cluster

cf27e9a

when wait signal is enabled. Resolves kubernetes-retired#525

mumoshu mentioned this issue Apr 12, 2017

Fix the dead-lock while bootstrapping etcd cluster #531

Merged

mumoshu closed this as completed in #531 Apr 12, 2017

kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this issue Mar 27, 2018

Fix the dead-lock while bootstrapping etcd cluster

bf5fe5f

when wait signal is enabled. Resolves kubernetes-retired#525

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deadlock when enabled cfn-signal on etcd nodes? #525

deadlock when enabled cfn-signal on etcd nodes? #525

redbaron commented Apr 10, 2017

camilb commented Apr 10, 2017

mumoshu commented Apr 11, 2017 via email

redbaron commented Apr 11, 2017

mumoshu commented Apr 12, 2017 •

edited

Loading

redbaron commented Apr 12, 2017

redbaron commented Apr 12, 2017

mumoshu commented Apr 12, 2017

mumoshu commented Apr 12, 2017 •

edited

Loading

mumoshu commented Apr 12, 2017

deadlock when enabled cfn-signal on etcd nodes? #525

deadlock when enabled cfn-signal on etcd nodes? #525

Comments

redbaron commented Apr 10, 2017

camilb commented Apr 10, 2017

mumoshu commented Apr 11, 2017 via email

redbaron commented Apr 11, 2017

mumoshu commented Apr 12, 2017 • edited Loading

redbaron commented Apr 12, 2017

redbaron commented Apr 12, 2017

mumoshu commented Apr 12, 2017

mumoshu commented Apr 12, 2017 • edited Loading

mumoshu commented Apr 12, 2017

mumoshu commented Apr 12, 2017 •

edited

Loading

mumoshu commented Apr 12, 2017 •

edited

Loading