Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

Heapster keeps restarting every minute in newly deployed k8s cluster #275

Closed
chreichert opened this issue Oct 28, 2018 · 4 comments · Fixed by #295
Closed

Heapster keeps restarting every minute in newly deployed k8s cluster #275

chreichert opened this issue Oct 28, 2018 · 4 comments · Fixed by #295

Comments

@chreichert
Copy link

Is this a request for help?:
yes

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
0.24.1

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes 1.10.9

What happened:
In a newly deployed k8s cluster the heapster component is is being redeployed every minute or so.

What you expected to happen:
Heapster running without problems.

How to reproduce it (as minimally and precisely as possible):
Initial setup of Cluster with acs-engine 0.24.1:

  • Kubernetes 1.10.9
  • Private cluster
  • RBAC enabled
  • Calico Networkpolicy
  • Disable addons:
    • blobfuse-flexvolume
    • smb-flexvolume
  • 1 Master
  • 3 Agentpools as vmss

Anything else we need to know:
I found a question in Stackoverflow (https://stackoverflow.com/questions/46110856/kubernetes-monitoring-service-heapster-keeps-restarting) which describes the very similar behaviour. There a fix was suggested to modify the heapster deployment to contain a label "addonmanager.kubernetes.io/mode: EnsureExists".

I edited the deployment manifest for heapster on all master nodes and changed that label for the heapster deployment from "Reconcile" to "EnsureExists", which fixed the problem for me.

Doing some research, I found the following PR Azure/acs-engine#3918, which changed that label from "EnsureExists" to "Reconcile", so you might check this PR again? See also PR Azure/acs-engine#1133, which introduced that label and was referenced in that above recommended fix.

@PacoGuy
Copy link

PacoGuy commented Oct 28, 2018

Same issue with acs-engine 0.24.2 and Kubernetes 1.10.8

Is there any way that we can please get better regression testing before changes are made to the system.
It is becoming nearly impossible to deploy new clusters between this heapster issue, kube-dns failures (7 restarts upon deployment before it stabilizes) and random failures to deploy cse-agents.

@CecileRobertMichon
Copy link
Contributor

@mboersma what was the reason for removing "EnsureExists" from the heapster deployment in the 1.12 PR?

@mboersma
Copy link
Member

mboersma commented Nov 8, 2018

My reasoning was that (as you pointed out elsewhere) we found that addons weren't being updated during upgrades unless the addon-manager mode label was Reconcile.

Heapster had both the Reconcile and EnsureExists labels, although they represent mutually exclusive policies as I read it. (I assume having duplicate label keys means the latter wins, so EnsureExists was overriding Reconcile.)

In any case, this was a purposeful change to fix acs-engine upgrade, but we should retest that assumption and see if it's possible to change the policy.

@shaikatz
Copy link

We were also facing this issue with kubernetes 1.11.4 bootstrapped with acs-engine 0.25.3 .

Elaborating a bit about @chreichert solution which did the trick for us as well.

SSH into each of the masters and modify /etc/kubernetes/addons/kube-heapster-deployment.yaml, In the Deployment section change addonmanager.kubernetes.io/mode from Reconclie to EnsureExists.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants