Prometheus and Alertmanager pods stuck in Pending state with Rook Ceph #433

surajssd · 2020-05-15T11:12:24Z

If you are using Rook Ceph as a storage provider and if you install Prometheus Operator before rook ceph, it is possible that the Prometheus and Alertmanager pods are stuck in Pending state.

component "prometheus-operator" {
  storage_class = "rook-ceph-block"
}

component "rook" {
  enable_monitoring = true
}

component "rook-ceph" {
  monitor_count = 3

  storage_class {
    enable = true
  }
}

Why would you install Prometheus Operator before Rook?

If you want to enable monitoring on the Rook Ceph then you should install Prometheus Operator before. If you don't do that(enable_monitoring = true) and try to install Rook then it will fail saying that there is no resource of kind ServiceMonitor or PrometheusRules. These CRDs are installed by Prometheus Operator.

Problem

When you install Prometheus Operator before the pods for Prometheus and Alertmanager come up and wait for their respective PVC to be bound. And Rook Ceph can take ~15mins to be functional.

And meanwhile the PVC can give up trying to bind the PVC to PV. And in such cases even after Rook is functional the pods are still in Pending state.

Workaround

kubectl -n monitoring delete pvc --all
kubectl -n monitoring delete sts --all

This will trigger the Prometheus Operator to recreate the Prometheus and Alertmanager statefulset and hence Rook will satisfy the request by these PVCs.

Permanent solution

Find a way to increase the timeout time or retry count for the controller which makes sure PVC and PV are bound.

The text was updated successfully, but these errors were encountered:

invidian · 2020-05-15T11:22:56Z

Note: this issue does occur for OpenEBS as well.

surajssd · 2020-05-20T14:42:53Z

We can include the Prometheus CRDs into the cluster installation step itself. This will make sure we install storage with monitoring enabled. And then install prometheus operator.

But the problem only is that we will have to divide the prometheus operator config into two and maintain it in two places.

invidian · 2020-05-20T15:23:25Z

But the problem only is that we will have to divide the prometheus operator config into two and maintain it in two places.

I think with Helm 3, the CRDs can be installed multiple times. I think we could do something like this:

// pseudo Go code
if componentsToInstall include "prometheus-operator" {
  componentsToInstall = append(componentsToInstall, "prometheus-operator-crds")
}

Not ideal, but seems quite simple. Ideally we should have some kind of dependency management implemented

invidian · 2020-06-02T11:00:36Z

To directly address the case mentioned in the issue, should we perform a check before installing the component, that the dependencies has been satisfied? In this case, we could check, if the default storage class is present in the cluster and fail otherwise. Unfortunately, we currently have no mechanism for doing that. The easiest to implement would probably be to introduce an interface like:

type ComponentPreInstall interface {
  PreInstallHook(kubeclient *restclient.Config) error
}

Then, before installing the component, we would try to cast Component interface to ComponentPreInstall and run PreInstallHook() function if succeeded.

Then of course we would make prometheus-operator component to implement that interface.

invidian · 2020-06-02T11:03:29Z

It seems to be, that we talk here about 2 different problems:

the dependency of components on prometheus operator CRDs - if prometheus-operator is not installed, enabling monitoring for other components will fail
the dependency of prometheus-operator on default storage class (the pods will be stuck pending if storage is not provided before the component is installed)

invidian · 2020-06-02T13:00:54Z

The following configuration works just fine, meaning it's fine if we need to wait ~15 minutes for Ceph to get ready.

component "rook" {}

component "rook-ceph" {
  monitor_count = 3

  storage_class {
    enable  = true
    default = true
  }
}

component "prometheus-operator" {}

iaguis · 2020-06-04T13:48:40Z

I'm confused, does the configuration mentioned in #433 (comment) work fine?

If that's so I think we can document that this ordering is needed for the components to work and this can be closed, right?

invidian · 2020-06-04T14:14:57Z

If that's so I think we can document that this ordering is needed for the components to work and this can be closed, right?

Yes, this configuration works fine. If we agree that the documentation is sufficient "implementation", then yes, we just need to document and move on.

However, better solution is to solve 2 problems I mentioned above, which I plan to create separate issues for:

It seems to be, that we talk here about 2 different problems:

the dependency of components on prometheus operator CRDs - if prometheus-operator is not installed, enabling monitoring for other components will fail

the dependency of prometheus-operator on default storage class (the pods will be stuck pending if storage is not provided before the component is installed)

invidian · 2020-06-04T14:32:09Z

Created #557 and #559 for those 2 problems.

iaguis · 2020-06-04T15:27:54Z

Let's close this now that we have

A way to make this work even if not ideal
Issues created to fix this in a proper way

invidian · 2020-06-04T18:46:56Z

We even sort of have it documented: https://github.com/kinvolk/lokomotive/blob/master/docs/configuration-reference/components/prometheus-operator.md#prerequisites 😄

surajssd added area/components Items related to components area/monitoring Monitoring bug Something isn't working labels May 15, 2020

surajssd added the proposed/next-sprint Issues proposed for next sprint label May 20, 2020

iaguis added priority/P1 High priority and removed proposed/next-sprint Issues proposed for next sprint labels May 20, 2020

invidian self-assigned this Jun 2, 2020

This was referenced Jun 4, 2020

Allow installing prometheus-operator CRDs before prometheus-operator itself #557

Open

Ensure, that default storage class is available before prometheus-operator component is installed #559

Open

iaguis closed this as completed Jun 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus and Alertmanager pods stuck in Pending state with Rook Ceph #433

Prometheus and Alertmanager pods stuck in Pending state with Rook Ceph #433

surajssd commented May 15, 2020

invidian commented May 15, 2020

surajssd commented May 20, 2020

invidian commented May 20, 2020

invidian commented Jun 2, 2020

invidian commented Jun 2, 2020

invidian commented Jun 2, 2020

iaguis commented Jun 4, 2020

invidian commented Jun 4, 2020 •

edited

Loading

invidian commented Jun 4, 2020

iaguis commented Jun 4, 2020

invidian commented Jun 4, 2020

Prometheus and Alertmanager pods stuck in Pending state with Rook Ceph #433

Prometheus and Alertmanager pods stuck in Pending state with Rook Ceph #433

Comments

surajssd commented May 15, 2020

Why would you install Prometheus Operator before Rook?

Problem

Workaround

Permanent solution

invidian commented May 15, 2020

surajssd commented May 20, 2020

invidian commented May 20, 2020

invidian commented Jun 2, 2020

invidian commented Jun 2, 2020

invidian commented Jun 2, 2020

iaguis commented Jun 4, 2020

invidian commented Jun 4, 2020 • edited Loading

invidian commented Jun 4, 2020

iaguis commented Jun 4, 2020

invidian commented Jun 4, 2020

invidian commented Jun 4, 2020 •

edited

Loading