Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout when validating admission webhook unreachable #896

Closed
msvechla opened this issue May 21, 2019 · 11 comments
Closed

Timeout when validating admission webhook unreachable #896

msvechla opened this issue May 21, 2019 · 11 comments

Comments

@msvechla
Copy link

Bug Report

What did you do?

Follow your Quickstart guide: https://www.elastic.co/guide/en/cloud-on-k8s/current/index.html

During the second step when applying the Elasticsearch resource definition, a timeout occurs and the resource is never created:

Error from server (Timeout): error when creating "STDIN": Timeout: request did not complete within requested timeout 30s 

What did you expect to see?

Creation of the Elasticsearch resource

What did you see instead? Under which circumstances?

Timeout, always.

Environment

  • Version information:

eck 0.8.0

  • Kubernetes information:

Running on EKS

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.6", GitCommit:"ab91afd7062d4240e95e51ac00a18bd58fddd365", GitTreeState:"clean", BuildDate:"2019-02-26T12:59:46Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-eks-d69f1b", GitCommit:"d69f1bf3669bf00b7f4a758e978e0e7a1e3a68f7", GitTreeState:"clean", BuildDate:"2019-02-28T20:26:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
  • Resource definition:
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.1.0
  nodes:
  - nodeCount: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
  • Logs:
    No relevant logs in operator, last message was:
{"level":"info","ts":1558423031.9141233,"logger":"kubebuilder.webhook","msg":"starting the webhook server."}
@barkbay
Copy link
Contributor

barkbay commented May 21, 2019

Hi,

I would like to be sure that the problem does not come from the validation webhook which is supposed to do some sanity check on the request.

Please could you:

  1. Backup the configuration of the validation webhook:
$ kubectl get ValidatingWebhookConfiguration -o yaml > ValidatingWebhookConfiguration.yaml
  1. Delete the ValdiationWebhook:
$ kubectl delete ValidatingWebhookConfiguration validating-webhook-configuration

And then try again ?

Thank you

@msvechla
Copy link
Author

After deleting it, the resource was created successfully. Here is the yaml, maybe it can help with debugging:

apiVersion: v1
items:
- apiVersion: admissionregistration.k8s.io/v1beta1
  kind: ValidatingWebhookConfiguration
  metadata:
    creationTimestamp: 2019-05-21T06:29:29Z
    generation: 1
    name: validating-webhook-configuration
    resourceVersion: "416167"
    selfLink: /apis/admissionregistration.k8s.io/v1beta1/validatingwebhookconfigurations/validating-webhook-configuration
    uid: c9db79f4-7b91-11e9-8da9-0271e0db6b8e
  webhooks:
  - clientConfig:
      caBundle: SENSITIVE
      service:
        name: elastic-webhook-service
        namespace: elastic-system
        path: /validate-elasticsearches
    failurePolicy: Fail
    name: validation.elasticsearch.elastic.co
    namespaceSelector:
      matchExpressions:
      - key: control-plane
        operator: DoesNotExist
    rules:
    - apiGroups:
      - elasticsearch.k8s.elastic.co
      apiVersions:
      - v1alpha1
      operations:
      - CREATE
      - UPDATE
      resources:
      - elasticsearches
    sideEffects: Unknown
  - clientConfig:
      caBundle: SENSITIVE
      service:
        name: elastic-webhook-service
        namespace: elastic-system
        path: /validate-enterpriselicenses
    failurePolicy: Fail
    name: validation.license.elastic.co
    namespaceSelector:
      matchExpressions:
      - key: control-plane
        operator: DoesNotExist
    rules:
    - apiGroups:
      - elasticsearch.k8s.elastic.co
      apiVersions:
      - v1alpha1
      operations:
      - CREATE
      - UPDATE
      resources:
      - enterpriselicenses
    sideEffects: Unknown
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

@msvechla
Copy link
Author

Fyi I just tried this again on a new cluster, same issues. Here are the startup logs:

"level":"info","ts":1558472733.2504811,"logger":"manager","msg":"Starting the Cmd."}
{"level":"info","ts":1558472733.3509061,"logger":"kubebuilder.webhook","msg":"installing webhook configuration in cluster"}
{"level":"info","ts":1558472733.351051,"logger":"kubebuilder.admission.cert.writer","msg":"cert is invalid or expiring, regenerating a new one"}
{"level":"info","ts":1558472733.3607376,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"elasticsearch-controller"}
{"level":"info","ts":1558472733.3608296,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"apmserver-controller"}
{"level":"info","ts":1558472733.3608878,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"apm-es-association-controller"}
{"level":"info","ts":1558472733.3609493,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"kibana-association-controller"}
{"level":"info","ts":1558472733.3610291,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"license-controller"}
{"level":"info","ts":1558472733.3611038,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"kibana-controller"}
{"level":"info","ts":1558472733.3611677,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"trial-controller"}
{"level":"info","ts":1558472733.3612182,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"remotecluster-controller"}
{"level":"info","ts":1558472733.4609866,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"apmserver-controller","worker count":1}
{"level":"info","ts":1558472733.4612098,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"kibana-controller","worker count":1}
{"level":"info","ts":1558472733.4613671,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"elasticsearch-controller","worker count":1}
{"level":"info","ts":1558472733.4614935,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"apm-es-association-controller","worker count":1}
{"level":"info","ts":1558472733.4616113,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"kibana-association-controller","worker count":1}
{"level":"info","ts":1558472733.4617238,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"license-controller","worker count":1}
{"level":"info","ts":1558472733.4618585,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"remotecluster-controller","worker count":1}
{"level":"info","ts":1558472733.4619713,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"trial-controller","worker count":1}
{"level":"info","ts":1558472733.9756606,"logger":"kubebuilder.webhook","msg":"starting the webhook server."}
{"level":"error","ts":1558472733.9759276,"logger":"kubebuilder.webhook","msg":"server returns an unexpected error","error":"open /tmp/cert/cert.pem: no such file or directory","stacktrace":"github.com/elastic/cloud-on-k8s/operators/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/elastic/cloud-on-k8s/operators/vendor/sigs.k8s.io/controller-runtime/pkg/webhook.(*Server).run\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/server.go:261\ngithub.com/elastic/cloud-on-k8s/operators/vendor/sigs.k8s.io/controller-runtime/pkg/webhook.(*Server).Start\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/sigs.k8s.io/controller-runtime/pkg/webhook/server.go:216\ngithub.com/elastic/cloud-on-k8s/operators/vendor/sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).start.func2\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:257"}
{"level":"error","ts":1558472733.9761424,"logger":"manager","msg":"unable to run the manager","error":"open /tmp/cert/cert.pem: no such file or directory","stacktrace":"github.com/elastic/cloud-on-k8s/operators/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/elastic/cloud-on-k8s/operators/cmd/manager.execute\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/cmd/manager/main.go:232\ngithub.com/elastic/cloud-on-k8s/operators/cmd/manager.glob..func1\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/cmd/manager/main.go:56\ngithub.com/elastic/cloud-on-k8s/operators/vendor/github.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/github.com/spf13/cobra/command.go:766\ngithub.com/elastic/cloud-on-k8s/operators/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/github.com/spf13/cobra/command.go:852\ngithub.com/elastic/cloud-on-k8s/operators/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/cmd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201"} 

Deleting the validating-webhook-configuration solved the issue again

@barkbay
Copy link
Contributor

barkbay commented May 22, 2019

Thanks for the update.

I'm investigating an issue that seems to be specific to the Amazon environment.
In the meantime be aware that the webhook is (re)created automatically when the operator is (re)started.

@pebrc
Copy link
Collaborator

pebrc commented May 22, 2019

The webhook can be disabled permanently by specifying --operator-roles=global,namespace in the operator stateful set spec instead of all

@thomasriley
Copy link

Also seen this issue running in GKE. Removing the ValidatingWebhookConfiguration allowed me to apply the demo Elasticsearch yaml.

Happy to help if you need any diagnostics.

trotro added a commit to trotro/cloud-on-k8s that referenced this issue Jun 21, 2019
Replacing ```all``` by ```global,namespace``` in the operator stateful set spec solves the issue elastic#896
@barkbay
Copy link
Contributor

barkbay commented Jul 1, 2019

Sorry for this late answer.

I did some successful tests on Amazon EKS.
My guess would be that a rule is missing in the security group of your nodes. Please could you check that the rule that allows some communication from the control plane to the port 443 on the nodes is present:

image

It is described as a "Recommended inbound traffic" by Amazon and it could cause this kind of issue by preventing communication from the control plane to the HTTPS server which implements the Validating Webhook.

@MiLk
Copy link

MiLk commented Jul 5, 2019

I had the same issue, and opening the port 443 from the control plane to the worker nodes solved it.

sebgl added a commit to sebgl/cloud-on-k8s that referenced this issue Jul 8, 2019
EKS users must explicitly enable communication from the k8s control
plane and nodes port 443 in order for the control plane to reach the
validating webhook.

Should help with elastic#896.
sebgl added a commit that referenced this issue Jul 9, 2019
EKS users must explicitly enable communication from the k8s control
plane and nodes port 443 in order for the control plane to reach the
validating webhook.

Should help with #896.
@thbkrkr
Copy link
Contributor

thbkrkr commented Jul 12, 2019

Closing this as a solution has been identified. Please reopen if needed.

@thbkrkr thbkrkr closed this as completed Jul 12, 2019
sebgl added a commit that referenced this issue Jul 12, 2019
* Support for APM server configuration (#1181)

* Add a config section to the APM server configuration

* APM: Add support for keystore

* Factorize ElasticsearchAuthSettings

* Update dev setup doc + fix GKE bootstrap script (#1203)

* Update dev setup doc + fix GKE bootstrap script

* Update wording of container registry authentication

* Ensure disks removal after removing cluster in GKE (#1163)

* Update gke-cluster.sh

* Implement cleanup for unused disks in GCP

* Update Makefile

* Update CI jobs to do proper cleanup

* Normalize the raw config when creating canonical configs (#1208)

This aims at counteracting the difference between JSON centric serialization and the use of YAML as the serialization format in canonical config. If not normalizing numeric values
like 1 will differ when comparing configs as JSON deserializes integer numbers to float64 and YAML to uint64.

* Homogenize logs (#1168)

* Don't run tests if only docs are changed (#1216)

* Update Jenkinsfile

* Simplify notOnlyDocs()

* Update Jenkinsfile

* Push snapshot ECK release on successful PR build (#1184)

* Update makefile's to support snapshots

* Add snapshot releases to Jenkins pipelines

* Cleanup

* Rename RELEASE to USE_ELASTIC_DOCKER_REGISTRY

* Update Jenkinsfile

* Add a note on EKS inbound traffic & validating webhook (#1211)

EKS users must explicitly enable communication from the k8s control
plane and nodes port 443 in order for the control plane to reach the
validating webhook.

Should help with #896.

* Update PodSpec with Hostname from PVC when re-using (#1204)

* Bind the Debug HTTP server to localhost by default (#1220)

* Run e2e tests against custom Docker image (#1135)

* Add implementation

* Update makefile's

* Update Makefile

* Rename Jenkisnfile

* Fix review comments

* Update e2e-custom.yml

* Update e2e-custom.yml

* Return deploy-all-in-one to normal

* Delete GKE cluster only if changes not in docs (#1223)

* Add operator version to resources (#1224)

* Warn if unsupported distribution (#1228)

The operator only works with the official ES distributions to enable the security
available with the basic (free), gold and platinum licenses in order to ensure that
all clusters launched are secured by default.

A check is done in the prepare-fs script by looking at the existence of the
Elastic License. If not present, the script exit with a custom exit code.

Then the ES reconcilation loop sends an event of type warning if it detects that
a prepare-fs init container terminated with this exit code.

* Document Elasticsearch update strategy change budget & groups (#1210)

Add documentation for the `updateStrategy` section of the Elasticsearch
spec.

It documents how (and why) `changeBudget` and `groups` are used by ECK,
and how both settings can be specified by the user.
@tadgh
Copy link

tadgh commented Jul 17, 2019

Just chiming in to say I experienced this only when attempting to add automated snapshots. When attempting to add the secureSettings to elasticsearch. The error was:

Internal error occurred: failed calling admission webhook "validation.elasticsearch.elastic.co": Post https://elastic-webhook-service.elastic-system.svc:443/validate-elasticsearches?timeout=30s: service "elastic-webhook-service" not found

Following the above advice of deleting the validating-webhook-configuration and trying again worked for me. Leaving this here as I'm pretty sure its the same issue.

@PaulGrandperrin
Copy link

PaulGrandperrin commented Jul 31, 2019

Hi, I saw that this issue was referenced in the documentation and I just want to dump how and why I got it and how we solved it to help other people.

  • On elastic-operator 0.8.X, everything was working fine
  • When upgrading to 0.9.0, all resource creation (kubectl apply -f quickstart.yaml) timed out without any helpful message.
  • After deep digging into many comments from many old issues, I try deleting the validatingwebhookconfigurations (BTW, this name feels like a generic name but I suspect it is in fact very specific the elastic-operator, am I right?)
  • Then everything works perfectly (without resource validation of course) but this is just a hack, not the solution.
  • After digging deeper I came to understand why it was working before 0.9.0:
    • we are using private clusters on GKE, which means the Kubernetes masters cannot communicate with any port on any pod. By default, only the ports 443 and 10250 (kubelet) are opened from the masters to the pods.
    • in 0.8.X, the elastic webhook was listening on the 443 port but this was changed to 9443 in 0.9.0 to avoid using the cap_net_bind_service capability: 7d778e8
    • The 9443 port is not whitelisted by default so this commit broke all GKE installation on private clusters
  • We added a firewall rule to allow traffic on 9443, here our terraform snippet:
// Certmanager deployment (webhook access)
resource "google_compute_firewall" "elastic_operator_webhook_ingress_cluster_2" {
  name      = "${var.project_suffix}-elastic-operator-webhook-ingress-cluster-2"
  network   = google_compute_network.net.name
  direction = "INGRESS"

  allow {
    protocol = "tcp"
    ports    = ["9443"]
  }

  source_ranges = [
    var.k8s_cluster_2_master_ipv4_cidr_block,
  ]

  target_tags = [
    "${var.k8s_cluster_2_name}-node",
  ]
}

You can find more info here:

@thbkrkr thbkrkr changed the title timeout when applying elasticsearch resource Timeout when validating admission webhook unreachable Jul 31, 2019
@barkbay barkbay mentioned this issue Sep 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants