-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metricbeat] [Kubernetes] Fix metrics and dashboards from scheduler, proxy and controller manager #34161
[Metricbeat] [Kubernetes] Fix metrics and dashboards from scheduler, proxy and controller manager #34161
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
I can't add comments to linter issues if the line is not part of a diff!!!! so I'll add my comment here Some issues:
I have googled this error but I can't find an easy way to fix it. @tetianakravchenko (since you added this line), any idea how to fix it? I tried to replicate what might happen if we remove the if condition. A for look with I also found this article where a |
…' into fix-kubernetes-deprecated-metrics
This reverts commit 4016a64.
There is something off with the |
I have some questions about some dashboards:
|
@constanca-m I assume this PR was raised to get up to date with latest release of Kubernetes. do you mind providing a bit more context in this PR (or link to an issue with the following details) about:
This is just so that we know in the future where those changes come from |
I called it Rules since both the visualizations relate to proxy rules:
Latency metrics include more visualizations.
No, it means that between time1 and time2, the CPU time increased one more second, not 100%. In the screenshot, having 0 and 1 is just a coincidence, but the increase is in seconds, not in %. Possible follow up question: why in s and not in %? There is not really an answer on which is better than the other, as they would end up saying the same thing: the increase on CPU time spent. If you meant the % of CPU being used, then there is no metric for that, that I know of. Metric being used for that visualization is
Yes, but I think variation is a more clear word for that visualization.
It means exactly that: the attempts counter rate split by result. If you check the metric being used,
Not in seconds. It is a unit. As in, a queue at 10am can have 10 people, and at 11am, can have 12 people. Then the variation is 2/h people, (12-10)/h.
Per definition: how many seconds of work has done that is in progress and hasn't been considered in the longest running processor.
Yes.
The answer to these questions is in the README file inside the controller manager/scheduler/proxy folder. |
thanks for the explanation. I am wondering if there is a way to add the above bit somehow to the dashboard. Or even just a link to docs with that explanation might be ok.
Ok, but it's still a bit confusing that a CPU usage is measured in seconds. Let me see if I understand correctly, let's say in the 10s between two different mesurements, that CPU spent 1s doing its job. From the name
Overall I'm trying to reduce the number of SDH that might come our way because a customer doesn't fully grasp the meaning of a visualisation. Please let me know if this makes sense. |
That is correct. The value can only increase, it is a counter.
This makes a lot of sense! I agree completely that the dashboards are not enough without more explanation, but the dashboards are also not the place to have them. So the solution would be, like you said, to create more documentation. I have talked about that with Miguel yesterday and the compromise is: I will create a draft for one of the components and then I will share it so we can see if it makes sense. The document will be added to our documentation (obs-docs) and it will be focused on the metrics we are using and what problems we can detect with them. For this, we will be using visualizations (like the counter rate for client error responses), and not the whole dashboard to save us the trouble of regular maintenance. It is not worth it to do that for all the components right now, without agreeing on a specific structure (it saves more time). I will try to do that in the upcoming days and share it with the team. Edit: I create an issue for this elastic/observability-docs#2474 |
@@ -106,6 +106,7 @@ https://github.com/elastic/beats/compare/v8.2.0\...main[Check the HEAD diff] | |||
- Fix kafka dashboard field names {pull}33555[33555] | |||
- Add tags to events based on parsed identifier. {pull}33472[33472] | |||
- Support Oracle-specific connection strings in SQL module {issue}32089[32089] {pull}32293[32293] | |||
- Remove deprecated metrics from controller manager, scheduler and proxy {pull}34161[34161] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you also add a new metrics, isn't it?
@@ -32,8 +32,8 @@ func TestEventMapping(t *testing.T) { | |||
ptest.TestMetricSet(t, "kubernetes", "controllermanager", | |||
ptest.TestCases{ | |||
{ | |||
MetricsFile: "./_meta/test/metrics.1.20", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropping 1.20 since this k8s version is not supported anymore https://endoflife.date/kubernetes
@@ -34,20 +34,24 @@ var mapping = &prometheus.MetricsMapping{ | |||
"process_open_fds": prometheus.Metric("process.fds.open.count"), | |||
"process_max_fds": prometheus.Metric("process.fds.max.count"), | |||
"process_start_time_seconds": prometheus.Metric("process.started.sec"), | |||
// rest_client_request_duration_seconds buckets declared in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment not valid anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, I moved it to the README to be in the same file as the other resources
@@ -56,7 +60,6 @@ var mapping = &prometheus.MetricsMapping{ | |||
"host": prometheus.KeyLabel("host"), | |||
"name": prometheus.KeyLabel("name"), | |||
"zone": prometheus.KeyLabel("zone"), | |||
"url": prometheus.KeyLabel("url"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fields was completely removed? from my understanding this fields was just recently introduced - https://github.com/elastic/beats/pull/32037/files#diff-dff4473d065113acb05a9f69b5854880be7761fe5a0267e99f869cb1503e9bde
is this field refer to rest-client metrics? kubernetes/component-base@fd4965b at some point url label was renamed to host
, that we already use above, so we drop it in favor of using host
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong, but I checked the supported versions. The oldest release date for a supported version is 07 Dec 2021 for 1.23. The commit that replace url
by host
was done on the 19th of November of 2021. I made the assumption that 1.23 version didn't support url
anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from my understanding - 1.23 is still using url
- https://github.com/kubernetes/component-base/blob/release-1.23/metrics/prometheus/restclient/metrics.go
@gizas any thought on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the difference it's just a single field, I think it's more important to have back compatibility with 1.23 so I'll prefer to keep that field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems to be right @tetianakravchenko, but how do we make sure of that? Do we compare all commits between the release date and today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems to be right @tetianakravchenko, but how do we make sure of that? Do we compare all commits between the release date and today?
@constanca-m I've selected the release-1.23
branch to see if this release still has url
or host
https://github.com/Mergifyio backport 8.6 |
✅ Backports have been created
|
…proxy and controller manager (#34161) * Update deprecated metrics from controller manager, scheduler and proxy * Update dashboards
What does this PR do?
This PR updates the metrics collected by controller manager, proxy and scheduler kubernetes endpoint. Dashboards are also updated.
Why is it important?
Scheduler metrics:
process_max_fds
added - interesting to relate to the already capturedprocess_open_fds
rest_client_request_duration_seconds
replaces deprecated metrichttp_request_duration_microseconds
rest_client_request_size_bytes
replaces deprecated metrichttp_request_size_bytes
rest_client_response_size_bytes
replaces deprecated metrichttp_requests_total
rest_client_requests_total
replaces deprecated metrichttp_requests_total
workqueue_longest_running_processor_seconds
addedworkqueue_unfinished_work_seconds
addedworkqueue_adds_total
addedworkqueue_depth
addedworkqueue_retries_total
addedscheduler_pending_pods
addedscheduler_preemption_victims
replaces deprecatedscheduler_pod_preemption_victims
scheduler_preemption_attempts_total
- interesting to relate toscheduler_preemption_victims
scheduler_scheduling_attempt_duration_seconds
replaces deprecated metricscheduler_e2e_scheduling_duration_seconds
scheduler_schedule_attempts_total
removed, can be similarly obtained throughscheduler_scheduling_attempt_duration_seconds
leader_election_master_status
addedqueue
,event
andprofile
fromscheduler_*
metrics addedname
fromworkqueue_*
andleader_*
metrics addedhandler
removed, not availableController manager metrics:
rest_client_response_size_bytes
rest_client_request_size_bytes
node_collector_evictions_total
replacesnode_collector_evictions_number
url
removed, not availableProxy metrics:
process_max_fds
- interesting to relate to the already capturedprocess_open_fds
rest_client_request_duration_seconds
replaces deprecated metrichttp_request_duration_microseconds
,rest_client_request_size_bytes
replaces deprecated metrichttp_request_size_bytes
rest_client_response_size_bytes
replaces deprecated metrichttp_response_size_bytes
handler
removed, not availableverb
added fromrest_client_*
(exceptrest_client_requests_total
) captured metricsscheduler_scheduling_attempt_duration_seconds
andrest_client_request_duration_seconds
metrics converted to microseconds for reason mentioned in #32486.Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
Use, for example, kind to have access to control plane components endpoint and set up the new dashboards.
Related issues
Screenshots
Proxy dashboard:
It is not great to have requests, client error and server error counter rates in different visualizations, but I couldn't find a way to put them together without the visualization becoming too confusing to read (following the assumption that there will be more than one host selected, which is seems more likely than not).
Scheduler dashboard:
Unlike proxy dashboard, requests and responses counter rates are put together as it is likely that there will be nodes that don't have control plane components, which makes it more readable.
Controller manager dashboard:
Followed by Workqueue, Process and HTTP Requests similar to scheduler dashboard.
Limitations
Limitation 1: If the user selects a period different than 10s (the default period as of today), the results will likely be incorrect.
Example: metricset.period 5s and minimum interval of 10s.
Data from
rest_client_requests_total
(counter) example:With an interval of 5s, the total of requests will be 11, 18 and 47 respectively.
With an interval of 10s, the total of requests will be 11 and 65 (18+47), because there is no way to do
last_value(sum...)
using a time series.Even in visualizations that don't have that problem, higher intervals might not reflect any spikes.
Limitation 2: Metrics that make use of average aren't enough to identify problems, but it is not always possible to use it with median since some metrics split values between buckets and there is no way to know their real values.