etcd version 3 does not send cfn-signal #497

jeremyd · 2017-04-04T19:46:02Z

<EDITED, this happens also with etcd2> I tried the new etcd support and specified version: 3.1.5 in the cluster.yaml. However, etcd starts, but is version 3.0.x and also cfn-signal never fires. I attempted to debug why it does not fire since etcd3 was able to startup ok and I'm seeing a weird issue. It does not look like cfn-signal was ever triggered to run and when I manually do systemctl start cfn-signal it hangs forever. When I look at the processlist for some reason it's showing this systemd-tty-ask-password-agent running... Using latest master.

root      3710  0.0  0.0  33420  3084 pts/0    S+   18:25   0:00                      \_ systemctl start cfn-signal
root      3711  0.0  0.0  33288  3188 pts/0    S+   18:25   0:00                          \_ /usr/bin/systemd-tty-ask-password-agent --watch

I just tried etcd2 also and it did not receive the signal, however this problem is slightly different since the cfn-signal did startup but cloudformation did not see the signal come in.

The text was updated successfully, but these errors were encountered:

camilb · 2017-04-04T21:51:09Z

I'm still investigating this issue. It's weird because it worked once, 3 ETCD nodes, all sent cfn-signal. Even got a snapshot on a S3 bucket . Now I don't manage to pass even one "success".Tomorrow I'm planning to increase the waiting time for the signal and try debugging each failed service on ETCD nodes.

camilb · 2017-04-05T16:15:48Z

@jeremyd maybe #504 fix this issue, in case you had terminated instances from previous deployments, as @mumoshu pointed this bug in #501.

* Bump to Kubernetes v1.6.1 This change was just the result of running the following commands: ``` $ contrib/bump-version v1.6.1_coreos.0 Updating contrib/bump-version Updating core/controlplane/config/config.go Updating core/controlplane/config/templates/cluster.yaml Updating e2e/kubernetes/Dockerfile Updating e2e/kubernetes/Makefile Updating vendor/github.com/aws/aws-sdk-go/CHANGELOG.md $ git checkout -p -- vendor ``` As etcd3 support is already introduced via #417, after this change is introduced, it was ideally a matter of running E2E against a newly created kube-aws cluster with k8s 1.6.1, which turned out not to be true, hence the subsequent changes. * Use etcd3 by default etcd2 support will be dropped soon, as the etcd3 storage driver is already the default since k8s v.1.6.0. * Bump to calico-cni v1.6.2, which is an even newer release than the one included in the latest calico v2.1.2, to deal with kubernetes/kubernetes#43488 * Set up /etc/kubernetes/cni/net.d not using calico-cni but by our own to deal with kubernetes/kubernetes#43014 * Set up /opt/cni/bin using docker rather than a k8s static pod to prevent temporary "failed to find plugin * in path" errors from cni They were emitted when pods are scheduled but /opt/cni/bin is not yet populated ``` Error syncing pod, skipping: failed to "CreatePodSandbox" for "kube-dns-3816048056-cwx62_kube-system(12c3204f-1a54-11e7-bfb0-06751e989ae7)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-dns-3816048056-cwx62_kube-system(12c3204f-1a54-11e7-bfb0-06751e989ae7)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"kube-dns-3816048056-cwx62_kube-system\" network: failed to find plugin \"loopback\" in path [/opt/loopback/bin /opt/cni/bin]" ``` * Fix a bug that resulted etcd-member.service to use the default version number 3.0.x regardless of what is specified via `etcd.version` in cluster.yaml. The bug was reported in #497 (comment) * Simplify EtcdVersion func According to the review comment #492 (review) * Fix permanent errors like "failed to find plugin * in path" from cni which was breaking cni + flannel/calico in k8s 1.6, by specifying the `--cni-bin-dir=/opt/cni/bin` flag for kubelets The default dir had been accidentally changed at least in k8s 1.6.0 and 1.6.1. Resolves #494 Resolves #495 E2E against a cluster with flannel passed after this change: ``` $ ETCD_VERSION=3 ETCD_SNAPSHOT_AUTOMATED=1 ETCD_DISASTER_RECOVERY_AUTOMATED=1 ETCD_COUNT=3 KUBE_AWS_CLUSTER_NAME=kubeaws2 ./run all *snip* Ran 151 of 588 Specs in 3492.050 seconds SUCCESS! -- 151 Passed | 0 Failed | 0 Pending | 437 Skipped PASS Ginkgo ran 1 suite in 58m12.359210255s Test Suite Passed 2017/04/04 09:35:29 util.go:127: Step './hack/ginkgo-e2e.sh --ginkgo.focus=\[Conformance\]' finished in 58m12.683100213s 2017/04/04 09:35:29 e2e.go:80: Done ``` Also passed against a cluster with calico: ``` Ran 151 of 588 Specs in 3381.108 seconds SUCCESS! -- 151 Passed | 0 Failed | 0 Pending | 437 Skipped PASS Ginkgo ran 1 suite in 56m21.415087252s Test Suite Passed 2017/04/06 03:58:20 util.go:131: Step './hack/ginkgo-e2e.sh --ginkgo.focus=\[Conformance\]' finished in 56m21.76726736s 2017/04/06 03:58:20 e2e.go:80: Done ```

mumoshu · 2017-04-06T05:01:53Z

Thanks to both of you 🙇
#504 is merged and kube-aws v0.9.6-rc.1 which includes the fix is released.
@jeremyd Could you please confirm if the new release work for you now?

jeremyd · 2017-04-06T06:29:15Z

This works now, thanks!

* Bump to Kubernetes v1.6.1 This change was just the result of running the following commands: ``` $ contrib/bump-version v1.6.1_coreos.0 Updating contrib/bump-version Updating core/controlplane/config/config.go Updating core/controlplane/config/templates/cluster.yaml Updating e2e/kubernetes/Dockerfile Updating e2e/kubernetes/Makefile Updating vendor/github.com/aws/aws-sdk-go/CHANGELOG.md $ git checkout -p -- vendor ``` As etcd3 support is already introduced via kubernetes-retired#417, after this change is introduced, it was ideally a matter of running E2E against a newly created kube-aws cluster with k8s 1.6.1, which turned out not to be true, hence the subsequent changes. * Use etcd3 by default etcd2 support will be dropped soon, as the etcd3 storage driver is already the default since k8s v.1.6.0. * Bump to calico-cni v1.6.2, which is an even newer release than the one included in the latest calico v2.1.2, to deal with kubernetes/kubernetes#43488 * Set up /etc/kubernetes/cni/net.d not using calico-cni but by our own to deal with kubernetes/kubernetes#43014 * Set up /opt/cni/bin using docker rather than a k8s static pod to prevent temporary "failed to find plugin * in path" errors from cni They were emitted when pods are scheduled but /opt/cni/bin is not yet populated ``` Error syncing pod, skipping: failed to "CreatePodSandbox" for "kube-dns-3816048056-cwx62_kube-system(12c3204f-1a54-11e7-bfb0-06751e989ae7)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-dns-3816048056-cwx62_kube-system(12c3204f-1a54-11e7-bfb0-06751e989ae7)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"kube-dns-3816048056-cwx62_kube-system\" network: failed to find plugin \"loopback\" in path [/opt/loopback/bin /opt/cni/bin]" ``` * Fix a bug that resulted etcd-member.service to use the default version number 3.0.x regardless of what is specified via `etcd.version` in cluster.yaml. The bug was reported in kubernetes-retired#497 (comment) * Simplify EtcdVersion func According to the review comment kubernetes-retired#492 (review) * Fix permanent errors like "failed to find plugin * in path" from cni which was breaking cni + flannel/calico in k8s 1.6, by specifying the `--cni-bin-dir=/opt/cni/bin` flag for kubelets The default dir had been accidentally changed at least in k8s 1.6.0 and 1.6.1. Resolves kubernetes-retired#494 Resolves kubernetes-retired#495 E2E against a cluster with flannel passed after this change: ``` $ ETCD_VERSION=3 ETCD_SNAPSHOT_AUTOMATED=1 ETCD_DISASTER_RECOVERY_AUTOMATED=1 ETCD_COUNT=3 KUBE_AWS_CLUSTER_NAME=kubeaws2 ./run all *snip* Ran 151 of 588 Specs in 3492.050 seconds SUCCESS! -- 151 Passed | 0 Failed | 0 Pending | 437 Skipped PASS Ginkgo ran 1 suite in 58m12.359210255s Test Suite Passed 2017/04/04 09:35:29 util.go:127: Step './hack/ginkgo-e2e.sh --ginkgo.focus=\[Conformance\]' finished in 58m12.683100213s 2017/04/04 09:35:29 e2e.go:80: Done ``` Also passed against a cluster with calico: ``` Ran 151 of 588 Specs in 3381.108 seconds SUCCESS! -- 151 Passed | 0 Failed | 0 Pending | 437 Skipped PASS Ginkgo ran 1 suite in 56m21.415087252s Test Suite Passed 2017/04/06 03:58:20 util.go:131: Step './hack/ginkgo-e2e.sh --ginkgo.focus=\[Conformance\]' finished in 56m21.76726736s 2017/04/06 03:58:20 e2e.go:80: Done ```

jeremyd closed this as completed Apr 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd version 3 does not send cfn-signal #497

etcd version 3 does not send cfn-signal #497

jeremyd commented Apr 4, 2017 •

edited

Loading

camilb commented Apr 4, 2017

camilb commented Apr 5, 2017

mumoshu commented Apr 6, 2017

jeremyd commented Apr 6, 2017

etcd version 3 does not send cfn-signal #497

etcd version 3 does not send cfn-signal #497

Comments

jeremyd commented Apr 4, 2017 • edited Loading

camilb commented Apr 4, 2017

camilb commented Apr 5, 2017

mumoshu commented Apr 6, 2017

jeremyd commented Apr 6, 2017

jeremyd commented Apr 4, 2017 •

edited

Loading