Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): Improve resilience to transient errors. Fixes #3791 and #3217 #3800

Closed
wants to merge 104 commits into from

Conversation

alexec
Copy link
Contributor

@alexec alexec commented Aug 17, 2020

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed the CLA.
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

Fixes #3791
Fixes #3217
Fixes #3645
See #1913
See #3812
See #3745
Depends on #3846

Changes

@sarabala1979 sarabala1979 self-assigned this Aug 17, 2020

// we must check to see if the pod exists rather than just optimistically creating the pod and see if we get
// an `AlreadyExists` error because we won't get that error if there is not enough resources
obj, exists, err := woc.controller.podInformer.GetStore().Get(cache.ExplicitKey(pod.Namespace + "/" + pod.Name))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution depends on the informer being correct when queries here.

If the informer is incorrect, i.e believe that a pod has not been created when it in fact has, then the workflow will error.

An alternative solution is to do podInterface.Get(pod.Name) and see what error it returns (i.e. is it an apierr.IsNotFound(..)?).

@@ -56,6 +56,7 @@ AUTH_MODE := client
endif
K3D := $(shell if [ "`which kubectl`" != '' ] && [ "`kubectl config current-context`" = "k3s-default" ]; then echo true; else echo false; fi)
LOG_LEVEL := debug
UPPERIO_DB_DEBUG := 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disabled the SQL logging by default now (I found it confusing)

@@ -2,6 +2,6 @@ apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- minio-pod.yaml
- minio-deployment.yaml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing to a deployment allows us to disable mysql by scaling it down

@@ -0,0 +1,9 @@
apiVersion: v1
kind: PersistentVolumeClaim
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a PVC for data to survive scaled down/up

@@ -5,7 +5,7 @@ metadata:
data:
containerRuntimeExecutor: pns
executor: |
imagePullPolicy: Never
imagePullPolicy: IfNotPresent
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the default for tests

// * Fails the `reapplyUpdate` pre-condition - it can never recover.
// * It will double the number of Kubernetes API requests.
if woc.orig.ResourceVersion != woc.wf.ResourceVersion {
panic("cannot re-apply update with unequal original and modified resource versions")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of panic. Can we re-queue the workflow and return? otherwise, log monitor like (Splunk) will keep reporting the panic

@alexec
Copy link
Contributor Author

alexec commented Aug 23, 2020

Broken up into smaller PRs that will be easier to review and to revert if/when needed.

@alexec alexec deleted the fix-3791 branch December 5, 2020 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants