-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise a metric whenever CAPI cannot see a remote cluster client #5510
Comments
/area health |
Praise for the clear definition of the required metric. |
/assign |
All credit to @chrischdi for the following info 🙂 We have a metric, exposed by client-go through controller-runtime, which reports error responses in the client. It's called
The host IP here is the ControlPlaneEndpoint IP - i.e. from Cluster It is currently exposed by CAPI's metrics endpoint (by default :8080/metrics) You can see it yourself (with kubecontext set to the management cluster) using: kubectl port-forward -n capi-system deployments/capi-controller-manager 8080:8080 &
curl localhost:8080/metrics | grep -i rest_client_requests_total For now we don't have an automated way to link the Cluster IP to the Cluster in prometheus. Once #6404 is added to the repo we can add a metric that will link these two pieces of information together giving remote client errors by cluster name / namespace. So this metric should be used to understand when the remote cluster is uncontactable. Does this suit your use case @perithompson ? |
@killianmuldoon @chrischdi I wonder if it would be possible to extend this metric and similar like it with an additional label for the cluster. I think this might be a very nice improvement as it makes it easier to use the metrics and avoid join's in PromQL. (to use those metrics you basically always have to join with another metrics which has the IP) |
In theory this would be possible by:
However for us it would be only possible to implement this in three ways, which I think is too much effort or have too many cons:
|
Maybe 1. or 3. is an option for the future. The upside to investing the effort in controller-runtime is usually that a lot of folks (including other providers) can profit from it. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale Work is still ongoing - tracked in #6458 to make the UX for this better for CAPI |
/triage accepted |
@fabriziopandini: GuidelinesPlease ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/lifecycle frozen |
This should meanwhile be possible by promql queries with the custom resource metrics configuration. Example:
(This graph shows the error response rate during cluster creation for this cluster per provider/controller) Based on this information it should be possible to create alerts :-) |
Very nice!! |
/priority important-longterm |
/close We already have that now, see Christian's example above |
@sbueringer: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
User Story
As an operator I would like a capi to raise a metric whenever it cannot see a remote cluster client to see where we can see a continuous rise in errors contacting the workload cluster client and potentially raise an alert.
Detailed Description
Similar to kubernetes-sigs/cluster-api-provider-vsphere#1281, it would be useful to be able to spot when clusters are not contactable from the management cluster so that we can monitor and alert when reconciliation should be paused until such time that the remote cluster is again contactable.
Anything else you would like to add:
This also relates to #5394 in that that is asking for more information on the annotations to be used when cluster communication is interrupted
/kind feature
The text was updated successfully, but these errors were encountered: