Reconciliation is blocked when a Pod can't be created #2266

barkbay · 2019-12-13T14:17:31Z

I got a situation where:

Pod could not be created because of some invalid values in the spec
ok, err := d.expectationsSatisfied() always returns false
It is impossible for the user to recover from that situation

The following manifest helps to recreate the issue:

apiVersion: elasticsearch.k8s.elastic.co/v1beta1
kind: Elasticsearch
metadata:
  name: blocked
spec:
  version: 7.5.0
  nodeSets:
    - name: default
      count: 1
      config:
        node.store.allow_mmap: false
      podTemplate:
        spec:
          volumes:
            - name: BaDVolumeName
              emptyDir: {}

The problem is that we assume that a Pod is at least in a Pending state, which is not the case.

The text was updated successfully, but these errors were encountered:

sebgl · 2019-12-13T17:17:05Z

Oh right... good catch. We don't create Pods ourselves anymore so we don't get a Pod creation error :(
And the error itself only appears in the StatefulSet events. I don't think we want to monitor those?

A way out of the above is for the user to manually delete the StatefulSet... but this is not a good idea if some good Pods already exist for it.
We could also set a timeout on our expectations (that's what k8s deployment expectations do, just in case) so we eventually reconcile when that happens?

I don't have any other idea so far.

sebgl · 2020-04-10T09:24:20Z

Happened with #2854. If you set invalid resource requirements (eg. cpu requests > cpu limits), the corresponding Pod can never be created, and ECK waits forever for that Pod to appear in expectations. Even though you may have fixed the requirements in the Elasticsearch manifest.

barkbay · 2020-04-16T15:10:10Z

I hit this one again while playing with topologySpreadConstraints on 1.1.0-rc3 and setting maxSkew to 0

default     6m3s        Warning   FailedCreate             statefulset/elasticsearch-sample-es-default   create Pod elasticsearch-sample-es-default-0 in StatefulSet elasticsearch-sample-es-default failed error: Pod "elasticsearch-sample-es-default-0" is invalid: spec.topologySpreadConstraints[0].maxSkew: Invalid value: 0: must be greater than zero

I was wondering if we can use the "dry run" feature, it is not available until K8S 1.13 but it returns an HTTP 400 error if it is not supported yet:

func validatePod(c k8s.Client, expected appsv1.StatefulSet) func() error {
	// Create a dummy Pod with the pod template
	return func() error {
		dummyPod := &v1.Pod{
			ObjectMeta: metav1.ObjectMeta{
				Namespace:   expected.Namespace,
				Name:        expected.Name + "-dummy-" + rand.String(5),
				Labels:      expected.Spec.Template.Labels,
				Annotations: expected.Spec.Template.Annotations,
			},
			Spec: expected.Spec.Template.Spec,
		}
		// Dry run is beta and available since Kubernetes 1.13
		if err := c.Create(dummyPod, client.DryRunAll); err != nil {
			// Openshift 3.11 and K8S 1.12 don't support dryRun but gently returns "400 - BadRequest" in that case
			if errors.ReasonForError(err) == metav1.StatusReasonBadRequest {
				return nil

			}
			// If the Pod spec is invalid the expected error is 422 - UNPROCESSABLE ENTITY
			// But for K8S >= 1.13 we should not have an error here
			return fmt.Errorf("error while validating Pod for %s/%s: %v", expected.Namespace, expected.Name, err)
		}
		return nil
	}
}

This could be done in the PreCreate or PreUpdate of the StatefulSets reconciliation (i.e. not during every reconcile loop) and avoid falling into this situation.

sebgl · 2020-04-16T15:14:29Z

😲 That's a great idea @barkbay!

barkbay · 2020-04-30T06:56:59Z

A few thoughts while I'm working on this one:

We can update the Elasticsearch Phase to Invalid as it is done when one of the existing validation fails :

NAME                                                                 HEALTH   NODES   VERSION   PHASE     AGE
elasticsearch.elasticsearch.k8s.elastic.co/elasticsearch-sample-ko   green    3       7.6.0     Invalid   4m50s

And ofc generate an event:

50s         Warning   ReconciliationError     elasticsearch/elasticsearch-sample-ko
Failed to apply spec change: Validation of PodTemplate for Elasticsearch default/elasticsearch-sample-ko failed for the following reasons:
[{FieldValueInvalid Invalid value: "BaDVolumeName": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?') spec.volumes[0].name} 
{FieldValueInvalid Invalid value: "4Gi": must be less than or equal to memory limit spec.containers[0].resources.requests}]

I think there is no need to do the same for other resources (Kibana,APM, Entsearch):
- Validation is already done when the Deployment is created/updated
- There is no Phase in the Status
- There is no "expectations" in these cases, I don't think it is really useful

barkbay added the >bug Something isn't working label Dec 13, 2019

sebgl mentioned this issue Apr 10, 2020

Sometimes changes of resources won't propagate to related statefulset #2854

Closed

sebgl mentioned this issue Apr 27, 2020

Elasticsearch 8.0.0-SNAPSHOT fails at startup due to volume permissions #2791

Closed

barkbay self-assigned this Apr 28, 2020

barkbay mentioned this issue Apr 30, 2020

Add PodTemplate semantic validation for Elasticsearch #3020

Merged

barkbay closed this as completed in #3020 May 5, 2020

d-guo mentioned this issue Jul 27, 2024

Reconciliation can get blocked when Statefulset Controller is stuck trying to create a Pod #7969

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconciliation is blocked when a Pod can't be created #2266

Reconciliation is blocked when a Pod can't be created #2266

barkbay commented Dec 13, 2019 •

edited

Loading

sebgl commented Dec 13, 2019

sebgl commented Apr 10, 2020

barkbay commented Apr 16, 2020

sebgl commented Apr 16, 2020

barkbay commented Apr 30, 2020

Reconciliation is blocked when a Pod can't be created #2266

Reconciliation is blocked when a Pod can't be created #2266

Comments

barkbay commented Dec 13, 2019 • edited Loading

sebgl commented Dec 13, 2019

sebgl commented Apr 10, 2020

barkbay commented Apr 16, 2020

sebgl commented Apr 16, 2020

barkbay commented Apr 30, 2020

barkbay commented Dec 13, 2019 •

edited

Loading