-
Notifications
You must be signed in to change notification settings - Fork 47
Prometheus and Alertmanager pods stuck in Pending state with Rook Ceph #433
Comments
Note: this issue does occur for OpenEBS as well. |
We can include the Prometheus CRDs into the cluster installation step itself. This will make sure we install storage with monitoring enabled. And then install prometheus operator. But the problem only is that we will have to divide the prometheus operator config into two and maintain it in two places. |
I think with Helm 3, the CRDs can be installed multiple times. I think we could do something like this: // pseudo Go code
if componentsToInstall include "prometheus-operator" {
componentsToInstall = append(componentsToInstall, "prometheus-operator-crds")
} Not ideal, but seems quite simple. Ideally we should have some kind of dependency management implemented |
To directly address the case mentioned in the issue, should we perform a check before installing the component, that the dependencies has been satisfied? In this case, we could check, if the default storage class is present in the cluster and fail otherwise. Unfortunately, we currently have no mechanism for doing that. The easiest to implement would probably be to introduce an interface like:
Then, before installing the component, we would try to cast Then of course we would make |
It seems to be, that we talk here about 2 different problems:
|
The following configuration works just fine, meaning it's fine if we need to wait ~15 minutes for Ceph to get ready. component "rook" {}
component "rook-ceph" {
monitor_count = 3
storage_class {
enable = true
default = true
}
}
component "prometheus-operator" {} |
I'm confused, does the configuration mentioned in #433 (comment) work fine? If that's so I think we can document that this ordering is needed for the components to work and this can be closed, right? |
Yes, this configuration works fine. If we agree that the documentation is sufficient "implementation", then yes, we just need to document and move on. However, better solution is to solve 2 problems I mentioned above, which I plan to create separate issues for:
|
Let's close this now that we have
|
We even sort of have it documented: https://github.com/kinvolk/lokomotive/blob/master/docs/configuration-reference/components/prometheus-operator.md#prerequisites 😄 |
If you are using Rook Ceph as a storage provider and if you install Prometheus Operator before rook ceph, it is possible that the Prometheus and Alertmanager pods are stuck in
Pending
state.Why would you install Prometheus Operator before Rook?
If you want to enable monitoring on the Rook Ceph then you should install Prometheus Operator before. If you don't do that(
enable_monitoring = true
) and try to install Rook then it will fail saying that there is no resource of kindServiceMonitor
orPrometheusRules
. These CRDs are installed by Prometheus Operator.Problem
When you install Prometheus Operator before the pods for Prometheus and Alertmanager come up and wait for their respective PVC to be
bound
. And Rook Ceph can take ~15mins to be functional.And meanwhile the PVC can give up trying to bind the PVC to PV. And in such cases even after Rook is functional the pods are still in Pending state.
Workaround
This will trigger the Prometheus Operator to recreate the Prometheus and Alertmanager statefulset and hence Rook will satisfy the request by these PVCs.
Permanent solution
Find a way to increase the timeout time or retry count for the controller which makes sure PVC and PV are bound.
The text was updated successfully, but these errors were encountered: