Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

etcd version 3 does not send cfn-signal #497

Closed
jeremyd opened this issue Apr 4, 2017 · 4 comments
Closed

etcd version 3 does not send cfn-signal #497

jeremyd opened this issue Apr 4, 2017 · 4 comments

Comments

@jeremyd
Copy link
Contributor

jeremyd commented Apr 4, 2017

<EDITED, this happens also with etcd2> I tried the new etcd support and specified version: 3.1.5 in the cluster.yaml. However, etcd starts, but is version 3.0.x and also cfn-signal never fires. I attempted to debug why it does not fire since etcd3 was able to startup ok and I'm seeing a weird issue. It does not look like cfn-signal was ever triggered to run and when I manually do systemctl start cfn-signal it hangs forever. When I look at the processlist for some reason it's showing this systemd-tty-ask-password-agent running... Using latest master.

root      3710  0.0  0.0  33420  3084 pts/0    S+   18:25   0:00                      \_ systemctl start cfn-signal
root      3711  0.0  0.0  33288  3188 pts/0    S+   18:25   0:00                          \_ /usr/bin/systemd-tty-ask-password-agent --watch

I just tried etcd2 also and it did not receive the signal, however this problem is slightly different since the cfn-signal did startup but cloudformation did not see the signal come in.

@camilb
Copy link
Contributor

camilb commented Apr 4, 2017

I'm still investigating this issue. It's weird because it worked once, 3 ETCD nodes, all sent cfn-signal. Even got a snapshot on a S3 bucket . Now I don't manage to pass even one "success".Tomorrow I'm planning to increase the waiting time for the signal and try debugging each failed service on ETCD nodes.

@camilb
Copy link
Contributor

camilb commented Apr 5, 2017

@jeremyd maybe #504 fix this issue, in case you had terminated instances from previous deployments, as @mumoshu pointed this bug in #501.

mumoshu added a commit that referenced this issue Apr 6, 2017
* Bump to Kubernetes v1.6.1

This change was just the result of running the following commands:

```
$ contrib/bump-version v1.6.1_coreos.0
Updating contrib/bump-version
Updating core/controlplane/config/config.go
Updating core/controlplane/config/templates/cluster.yaml
Updating e2e/kubernetes/Dockerfile
Updating e2e/kubernetes/Makefile
Updating vendor/github.com/aws/aws-sdk-go/CHANGELOG.md
$ git checkout -p -- vendor
```

As etcd3 support is already introduced via #417, after this change is introduced, it was ideally a matter of running E2E against a newly created kube-aws cluster with k8s 1.6.1, which turned out not to be true, hence the subsequent changes.

* Use etcd3 by default

etcd2 support will be dropped soon, as the etcd3 storage driver is already the default since k8s v.1.6.0.

* Bump to calico-cni v1.6.2, which is an even newer release than the one included in the latest calico v2.1.2, to deal with kubernetes/kubernetes#43488

* Set up /etc/kubernetes/cni/net.d not using calico-cni but by our own to deal with kubernetes/kubernetes#43014

* Set up /opt/cni/bin using docker rather than a k8s static pod to prevent temporary "failed to find plugin * in path" errors from cni

They were emitted when pods are scheduled but /opt/cni/bin is not yet populated

```
Error syncing pod, skipping: failed to "CreatePodSandbox" for "kube-dns-3816048056-cwx62_kube-system(12c3204f-1a54-11e7-bfb0-06751e989ae7)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-dns-3816048056-cwx62_kube-system(12c3204f-1a54-11e7-bfb0-06751e989ae7)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"kube-dns-3816048056-cwx62_kube-system\" network: failed to find plugin \"loopback\" in path [/opt/loopback/bin /opt/cni/bin]"
```

* Fix a bug that resulted etcd-member.service to use the default version number 3.0.x regardless of what is specified via `etcd.version` in cluster.yaml. The bug was reported in #497 (comment)

* Simplify EtcdVersion func

According to the review comment #492 (review)

* Fix permanent errors like "failed to find plugin * in path" from cni which was breaking cni + flannel/calico in k8s 1.6, by specifying the `--cni-bin-dir=/opt/cni/bin` flag for kubelets

The default dir had been accidentally changed at least in k8s 1.6.0 and 1.6.1.

Resolves #494
Resolves #495

E2E against a cluster with flannel passed after this change:

```
$ ETCD_VERSION=3 ETCD_SNAPSHOT_AUTOMATED=1 ETCD_DISASTER_RECOVERY_AUTOMATED=1 ETCD_COUNT=3 KUBE_AWS_CLUSTER_NAME=kubeaws2 ./run all
*snip*
Ran 151 of 588 Specs in 3492.050 seconds
SUCCESS! -- 151 Passed | 0 Failed | 0 Pending | 437 Skipped PASS

Ginkgo ran 1 suite in 58m12.359210255s
Test Suite Passed
2017/04/04 09:35:29 util.go:127: Step './hack/ginkgo-e2e.sh --ginkgo.focus=\[Conformance\]' finished in 58m12.683100213s
2017/04/04 09:35:29 e2e.go:80: Done
```

Also passed against a cluster with calico:

```
Ran 151 of 588 Specs in 3381.108 seconds
SUCCESS! -- 151 Passed | 0 Failed | 0 Pending | 437 Skipped PASS

Ginkgo ran 1 suite in 56m21.415087252s
Test Suite Passed
2017/04/06 03:58:20 util.go:131: Step './hack/ginkgo-e2e.sh --ginkgo.focus=\[Conformance\]' finished in 56m21.76726736s
2017/04/06 03:58:20 e2e.go:80: Done
```
@mumoshu
Copy link
Contributor

mumoshu commented Apr 6, 2017

Thanks to both of you 🙇
#504 is merged and kube-aws v0.9.6-rc.1 which includes the fix is released.
@jeremyd Could you please confirm if the new release work for you now?

@jeremyd
Copy link
Contributor Author

jeremyd commented Apr 6, 2017

This works now, thanks!

@jeremyd jeremyd closed this as completed Apr 6, 2017
kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this issue Mar 27, 2018
* Bump to Kubernetes v1.6.1

This change was just the result of running the following commands:

```
$ contrib/bump-version v1.6.1_coreos.0
Updating contrib/bump-version
Updating core/controlplane/config/config.go
Updating core/controlplane/config/templates/cluster.yaml
Updating e2e/kubernetes/Dockerfile
Updating e2e/kubernetes/Makefile
Updating vendor/github.com/aws/aws-sdk-go/CHANGELOG.md
$ git checkout -p -- vendor
```

As etcd3 support is already introduced via kubernetes-retired#417, after this change is introduced, it was ideally a matter of running E2E against a newly created kube-aws cluster with k8s 1.6.1, which turned out not to be true, hence the subsequent changes.

* Use etcd3 by default

etcd2 support will be dropped soon, as the etcd3 storage driver is already the default since k8s v.1.6.0.

* Bump to calico-cni v1.6.2, which is an even newer release than the one included in the latest calico v2.1.2, to deal with kubernetes/kubernetes#43488

* Set up /etc/kubernetes/cni/net.d not using calico-cni but by our own to deal with kubernetes/kubernetes#43014

* Set up /opt/cni/bin using docker rather than a k8s static pod to prevent temporary "failed to find plugin * in path" errors from cni

They were emitted when pods are scheduled but /opt/cni/bin is not yet populated

```
Error syncing pod, skipping: failed to "CreatePodSandbox" for "kube-dns-3816048056-cwx62_kube-system(12c3204f-1a54-11e7-bfb0-06751e989ae7)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-dns-3816048056-cwx62_kube-system(12c3204f-1a54-11e7-bfb0-06751e989ae7)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"kube-dns-3816048056-cwx62_kube-system\" network: failed to find plugin \"loopback\" in path [/opt/loopback/bin /opt/cni/bin]"
```

* Fix a bug that resulted etcd-member.service to use the default version number 3.0.x regardless of what is specified via `etcd.version` in cluster.yaml. The bug was reported in kubernetes-retired#497 (comment)

* Simplify EtcdVersion func

According to the review comment kubernetes-retired#492 (review)

* Fix permanent errors like "failed to find plugin * in path" from cni which was breaking cni + flannel/calico in k8s 1.6, by specifying the `--cni-bin-dir=/opt/cni/bin` flag for kubelets

The default dir had been accidentally changed at least in k8s 1.6.0 and 1.6.1.

Resolves kubernetes-retired#494
Resolves kubernetes-retired#495

E2E against a cluster with flannel passed after this change:

```
$ ETCD_VERSION=3 ETCD_SNAPSHOT_AUTOMATED=1 ETCD_DISASTER_RECOVERY_AUTOMATED=1 ETCD_COUNT=3 KUBE_AWS_CLUSTER_NAME=kubeaws2 ./run all
*snip*
Ran 151 of 588 Specs in 3492.050 seconds
SUCCESS! -- 151 Passed | 0 Failed | 0 Pending | 437 Skipped PASS

Ginkgo ran 1 suite in 58m12.359210255s
Test Suite Passed
2017/04/04 09:35:29 util.go:127: Step './hack/ginkgo-e2e.sh --ginkgo.focus=\[Conformance\]' finished in 58m12.683100213s
2017/04/04 09:35:29 e2e.go:80: Done
```

Also passed against a cluster with calico:

```
Ran 151 of 588 Specs in 3381.108 seconds
SUCCESS! -- 151 Passed | 0 Failed | 0 Pending | 437 Skipped PASS

Ginkgo ran 1 suite in 56m21.415087252s
Test Suite Passed
2017/04/06 03:58:20 util.go:131: Step './hack/ginkgo-e2e.sh --ginkgo.focus=\[Conformance\]' finished in 56m21.76726736s
2017/04/06 03:58:20 e2e.go:80: Done
```
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants