Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconciliation is blocked when a Pod can't be created #2266

Closed
barkbay opened this issue Dec 13, 2019 · 5 comments · Fixed by #3020
Closed

Reconciliation is blocked when a Pod can't be created #2266

barkbay opened this issue Dec 13, 2019 · 5 comments · Fixed by #3020
Assignees
Labels
>bug Something isn't working

Comments

@barkbay
Copy link
Contributor

barkbay commented Dec 13, 2019

I got a situation where:

  1. Pod could not be created because of some invalid values in the spec
  2. ok, err := d.expectationsSatisfied() always returns false
  3. It is impossible for the user to recover from that situation

The following manifest helps to recreate the issue:

apiVersion: elasticsearch.k8s.elastic.co/v1beta1
kind: Elasticsearch
metadata:
  name: blocked
spec:
  version: 7.5.0
  nodeSets:
    - name: default
      count: 1
      config:
        node.store.allow_mmap: false
      podTemplate:
        spec:
          volumes:
            - name: BaDVolumeName
              emptyDir: {}

The problem is that we assume that a Pod is at least in a Pending state, which is not the case.

@barkbay barkbay added the >bug Something isn't working label Dec 13, 2019
@sebgl
Copy link
Contributor

sebgl commented Dec 13, 2019

Oh right... good catch. We don't create Pods ourselves anymore so we don't get a Pod creation error :(
And the error itself only appears in the StatefulSet events. I don't think we want to monitor those?

A way out of the above is for the user to manually delete the StatefulSet... but this is not a good idea if some good Pods already exist for it.
We could also set a timeout on our expectations (that's what k8s deployment expectations do, just in case) so we eventually reconcile when that happens?

I don't have any other idea so far.

@sebgl
Copy link
Contributor

sebgl commented Apr 10, 2020

Happened with #2854. If you set invalid resource requirements (eg. cpu requests > cpu limits), the corresponding Pod can never be created, and ECK waits forever for that Pod to appear in expectations. Even though you may have fixed the requirements in the Elasticsearch manifest.

@barkbay
Copy link
Contributor Author

barkbay commented Apr 16, 2020

I hit this one again while playing with topologySpreadConstraints on 1.1.0-rc3 and setting maxSkew to 0

default     6m3s        Warning   FailedCreate             statefulset/elasticsearch-sample-es-default   create Pod elasticsearch-sample-es-default-0 in StatefulSet elasticsearch-sample-es-default failed error: Pod "elasticsearch-sample-es-default-0" is invalid: spec.topologySpreadConstraints[0].maxSkew: Invalid value: 0: must be greater than zero

I was wondering if we can use the "dry run" feature, it is not available until K8S 1.13 but it returns an HTTP 400 error if it is not supported yet:

func validatePod(c k8s.Client, expected appsv1.StatefulSet) func() error {
	// Create a dummy Pod with the pod template
	return func() error {
		dummyPod := &v1.Pod{
			ObjectMeta: metav1.ObjectMeta{
				Namespace:   expected.Namespace,
				Name:        expected.Name + "-dummy-" + rand.String(5),
				Labels:      expected.Spec.Template.Labels,
				Annotations: expected.Spec.Template.Annotations,
			},
			Spec: expected.Spec.Template.Spec,
		}
		// Dry run is beta and available since Kubernetes 1.13
		if err := c.Create(dummyPod, client.DryRunAll); err != nil {
			// Openshift 3.11 and K8S 1.12 don't support dryRun but gently returns "400 - BadRequest" in that case
			if errors.ReasonForError(err) == metav1.StatusReasonBadRequest {
				return nil

			}
			// If the Pod spec is invalid the expected error is 422 - UNPROCESSABLE ENTITY
			// But for K8S >= 1.13 we should not have an error here
			return fmt.Errorf("error while validating Pod for %s/%s: %v", expected.Namespace, expected.Name, err)
		}
		return nil
	}
}

This could be done in the PreCreate or PreUpdate of the StatefulSets reconciliation (i.e. not during every reconcile loop) and avoid falling into this situation.

@sebgl
Copy link
Contributor

sebgl commented Apr 16, 2020

😲 That's a great idea @barkbay!

@barkbay
Copy link
Contributor Author

barkbay commented Apr 30, 2020

A few thoughts while I'm working on this one:

  • We can update the Elasticsearch Phase to Invalid as it is done when one of the existing validation fails :
NAME                                                                 HEALTH   NODES   VERSION   PHASE     AGE
elasticsearch.elasticsearch.k8s.elastic.co/elasticsearch-sample-ko   green    3       7.6.0     Invalid   4m50s

And ofc generate an event:

50s         Warning   ReconciliationError     elasticsearch/elasticsearch-sample-ko
Failed to apply spec change: Validation of PodTemplate for Elasticsearch default/elasticsearch-sample-ko failed for the following reasons:
[{FieldValueInvalid Invalid value: "BaDVolumeName": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?') spec.volumes[0].name} 
{FieldValueInvalid Invalid value: "4Gi": must be less than or equal to memory limit spec.containers[0].resources.requests}]
  • I think there is no need to do the same for other resources (Kibana,APM, Entsearch):
    • Validation is already done when the Deployment is created/updated
    • There is no Phase in the Status
    • There is no "expectations" in these cases, I don't think it is really useful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants