Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

deadlock when enabled cfn-signal on etcd nodes? #525

Closed
redbaron opened this issue Apr 10, 2017 · 9 comments
Closed

deadlock when enabled cfn-signal on etcd nodes? #525

redbaron opened this issue Apr 10, 2017 · 9 comments

Comments

@redbaron
Copy link
Contributor

It sees that there is no support of cfn-signal on etcd nodes or maybe I am doing something wrong?

deadlock goes like following:

  • etcd-member is of type notify which is then fired when etcd server joins the cluster
  • cfn-signal waits for etcd-member to become is-active which never happens until it reports readiness to systemd
  • cloudformation waits for cfn-signal to fire before moving on to next etcd server

etcd-member can't join cluster as it first one and seems to wait for others to pop up, but it can't happen because cloudformation wont start next etcd until first one reports success

@camilb
Copy link
Contributor

camilb commented Apr 10, 2017

@redbaron It assign a EIP to your first node? Had something similar today, but it was because I forgot to add some parameters to VPC config. Do get this error in journald?

"Apr 10 13:05:30 ip-10-0-1-108.ec2.internal bash[1059]: run: discovery failed"

@mumoshu
Copy link
Contributor

mumoshu commented Apr 11, 2017 via email

@redbaron
Copy link
Contributor Author

@mumoshu , etcd.disasterRecovery.automated isn't enabled by default , therefore it disables etcdadm-reconfigure.service . I am porting our config to new version and we don't have this feature in existing one. Does it mean that this flag isn't really a flag and should always be enabled?

@mumoshu
Copy link
Contributor

mumoshu commented Apr 12, 2017

@redbaron As you've seen etcd.disasterRecovery.automated is disabled by default, but etcdadm-reconfigure.service. etcdadm-reconfigure.service is enabled as long as the specified etcd version is 3+, which is the default.

@redbaron
Copy link
Contributor Author

@mumoshu , hmm, true.I'll investigate further

@redbaron
Copy link
Contributor Author

@mumoshu , it sees that there is this line https://github.com/kubernetes-incubator/kube-aws/blob/dd345a7bacf74076883f4f8c21df90d18e6e1ab9/core/controlplane/config/templates/cloud-config-etcd#L154 which makes etcdadm-reconfigure.service to not start unless automation is enabled

@mumoshu
Copy link
Contributor

mumoshu commented Apr 12, 2017

@redbaron Wow!!! You are correct. It must be {{if .Etcd.DisasterRecovery.SupportsEtcdVersion .Etcd.Version -}}, too...

mumoshu added a commit to mumoshu/kube-aws that referenced this issue Apr 12, 2017
@mumoshu
Copy link
Contributor

mumoshu commented Apr 12, 2017

@redbaron It is now fixed in master! Would you mind confirming if it works for you?

@mumoshu
Copy link
Contributor

mumoshu commented Apr 12, 2017

Perhaps I couldn't notice the issue due to my insufficient test setup, which contained only one etcd node. To reproduce the issue, I believe I needed at least 3 or more, odd number of etcd nodes.

tyrannasaurusbanks pushed a commit to tyrannasaurusbanks/kube-aws that referenced this issue Apr 19, 2017
…update-to-latest-kube-aws-master to hcom-flavour

* commit '175217133f75b3c251536bc0d51ccafd2b1a5de4':
  Fix the dead-lock while bootstrapping etcd cluster when wait signal is enabled. Resolves kubernetes-retired#525
  Fix elasticFileSystemId to be propagated to node pools Resolves kubernetes-retired#487
  'Cluster-dump' feature to export Kubernetes Resources to S3
  Follow-up for the multi API endpoints support This fixes the issue which prevented a k8s cluster from being properly configured when multiple API endpoints are defined in cluster.yaml.
  Fix incorrect validations on apiEndpoints Ref kubernetes-retired#520 (comment)
  Wait until kube-system becomes ready Resolves kubernetes-retired#467
camilb added a commit to camilb/kube-aws that referenced this issue Apr 21, 2017
* kubernetes-incubator/master:
  Don't mount /var/lib/rkt into kubelet to avoid shared bind-mounts propagation
  Fix to calico configuration file etcd endpoints
  Fix hyperlink to restore script in Readme.md. Reference 'autosave' rather than 'export' in comments of cluster.yaml.
  'Restore' feature to restore Kubernetes Resources from S3 backup
  Add missing '/' when constructing the Autosave S3 put path
  Shared Persistent Volume (kubernetes-retired#471)
  Fix an incorrect variable name in the e2e/run script
  Add documentation for administrating etcd cluster Resolves kubernetes-retired#491
  use gzip base64 encoding for customFiles
  New options: customFiles and customSystemdUnits
  Add cluster.yaml details for apiEndpointName
  Fix the dead-lock while bootstrapping etcd cluster when wait signal is enabled. Resolves kubernetes-retired#525
  Fix elasticFileSystemId to be propagated to node pools Resolves kubernetes-retired#487
  Minor fixup for etcd unit files
  Fix up apiEndpoints.loadBalancer config
kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this issue Mar 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants