Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat] [Kubernetes] Fix metrics and dashboards from scheduler, proxy and controller manager #34161

Merged
merged 19 commits into from
Jan 5, 2023
Merged

[Metricbeat] [Kubernetes] Fix metrics and dashboards from scheduler, proxy and controller manager #34161

merged 19 commits into from
Jan 5, 2023

Conversation

constanca-m
Copy link
Contributor

@constanca-m constanca-m commented Jan 3, 2023

What does this PR do?

This PR updates the metrics collected by controller manager, proxy and scheduler kubernetes endpoint. Dashboards are also updated.

Why is it important?

Scheduler metrics:

  • process_max_fds added - interesting to relate to the already captured process_open_fds
  • rest_client_request_duration_seconds replaces deprecated metric http_request_duration_microseconds
  • rest_client_request_size_bytes replaces deprecated metric http_request_size_bytes
  • rest_client_response_size_bytes replaces deprecated metric http_requests_total
  • rest_client_requests_total replaces deprecated metric http_requests_total
  • workqueue_longest_running_processor_seconds added
  • workqueue_unfinished_work_seconds added
  • workqueue_adds_total added
  • workqueue_depth added
  • workqueue_retries_total added
  • scheduler_pending_pods added
  • scheduler_preemption_victims replaces deprecated scheduler_pod_preemption_victims
  • scheduler_preemption_attempts_total - interesting to relate to scheduler_preemption_victims
  • scheduler_scheduling_attempt_duration_seconds replaces deprecated metric scheduler_e2e_scheduling_duration_seconds
  • scheduler_schedule_attempts_total removed, can be similarly obtained through scheduler_scheduling_attempt_duration_seconds
  • leader_election_master_status added
  • queue, event and profile from scheduler_* metrics added
  • name from workqueue_* and leader_* metrics added
  • handler removed, not available

Controller manager metrics:

  • rest_client_response_size_bytes
  • rest_client_request_size_bytes
  • node_collector_evictions_total replaces node_collector_evictions_number
  • url removed, not available

Proxy metrics:

  • process_max_fds - interesting to relate to the already captured process_open_fds
  • rest_client_request_duration_seconds replaces deprecated metric http_request_duration_microseconds,
  • rest_client_request_size_bytes replaces deprecated metric http_request_size_bytes
  • rest_client_response_size_bytes replaces deprecated metric http_response_size_bytes
  • handler removed, not available
  • verb added from rest_client_* (except rest_client_requests_total) captured metrics

scheduler_scheduling_attempt_duration_seconds and rest_client_request_duration_seconds metrics converted to microseconds for reason mentioned in #32486.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Metrics are relevant
  • The purpose of the dashboard is clear
  • The dashboards highlight possible problems

How to test this PR locally

  1. Build the metricbeat image using this PR branch.
  2. Deploy metricbeat with it and proxy, scheduler and controller manager metricsets enabled:
    - module: kubernetes
      metricsets:
        - proxy
      period: 10s
      host: ${NODE_NAME}
      hosts: ["localhost:10249"]
    
    - module: kubernetes
      metricsets:
        - controllermanager
      period: 10s
      host: ${NODE_NAME}
      hosts: ['https://0.0.0.0:10257']
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      ssl.verification_mode: "none"
    
    - module: kubernetes
      metricsets:
        - scheduler
      period: 10s
      host: ${NODE_NAME}
      hosts: ['https://0.0.0.0:10259']
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      ssl.verification_mode: "none"

Use, for example, kind to have access to control plane components endpoint and set up the new dashboards.

Related issues

Screenshots

Proxy dashboard:
image
image
image

It is not great to have requests, client error and server error counter rates in different visualizations, but I couldn't find a way to put them together without the visualization becoming too confusing to read (following the assumption that there will be more than one host selected, which is seems more likely than not).

Scheduler dashboard:
image
image
image
image

Unlike proxy dashboard, requests and responses counter rates are put together as it is likely that there will be nodes that don't have control plane components, which makes it more readable.

Controller manager dashboard:
image
Followed by Workqueue, Process and HTTP Requests similar to scheduler dashboard.

Limitations

Limitation 1: If the user selects a period different than 10s (the default period as of today), the results will likely be incorrect.

Example: metricset.period 5s and minimum interval of 10s.
Data from rest_client_requests_total (counter) example:

Request data 01:00:00h 01:00:05h 01:00:10h
code="200",host="172.18.0.5:6443",method="GET" 10 15 40
code="201",host="172.18.0.5:6443",method="POST" 1 3 7
Total 11 18 47

With an interval of 5s, the total of requests will be 11, 18 and 47 respectively.
With an interval of 10s, the total of requests will be 11 and 65 (18+47), because there is no way to do last_value(sum...) using a time series.
Even in visualizations that don't have that problem, higher intervals might not reflect any spikes.

Limitation 2: Metrics that make use of average aren't enough to identify problems, but it is not always possible to use it with median since some metrics split values between buckets and there is no way to know their real values.

@constanca-m constanca-m added bug Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team labels Jan 3, 2023
@constanca-m constanca-m self-assigned this Jan 3, 2023
@constanca-m constanca-m requested a review from a team as a code owner January 3, 2023 14:59
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 3, 2023
@mergify
Copy link
Contributor

mergify bot commented Jan 3, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @constanca-m? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@constanca-m constanca-m requested review from a team, gsantoro and devamanv and removed request for a team January 3, 2023 15:22
@constanca-m constanca-m added the backport-v8.6.0 Automated backport with mergify label Jan 3, 2023
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jan 3, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-01-04T16:54:28.976+0000

  • Duration: 59 min 47 sec

Test stats 🧪

Test Results
Failed 0
Passed 3930
Skipped 876
Total 4806

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@gsantoro
Copy link
Contributor

gsantoro commented Jan 4, 2023

I can't add comments to linter issues if the line is not part of a diff!!!! so I'll add my comment here

Some issues:

  1. I can't replicate a linter issue invalid operation: cannot compare m.clusterMeta != nil (operator != not defined on untyped nil) (typecheck)

I have googled this error but I can't find an easy way to fix it.

@tetianakravchenko (since you added this line), any idea how to fix it?

I tried to replicate what might happen if we remove the if condition. A for look with range over an empty map (code inside DeepUpdate) should just work. Meaning, that it won't error and it won't execute any code (as expected).

I also found this article where a Better solution propose to use reflection to check for null in case the type is a map.
any suggestions?

@constanca-m constanca-m requested a review from a team as a code owner January 4, 2023 13:26
@constanca-m constanca-m requested a review from a team as a code owner January 4, 2023 15:39
@constanca-m constanca-m requested review from cmacknz and faec and removed request for a team January 4, 2023 15:39
@gsantoro
Copy link
Contributor

gsantoro commented Jan 4, 2023

There is something off with the darwin linter since all those declared but not used (typecheck) at metricbeat/module/kubernetes/util/kubernetes.go don't make sense. many of those complain about a variable that is not used while the variable is actually used.

@cmacknz cmacknz removed request for cmacknz and faec January 4, 2023 19:11
@gsantoro
Copy link
Contributor

gsantoro commented Jan 5, 2023

I have some questions about some dashboards:

  • proxy dashboard
    • why is it called Rules for latency metrics?
    • process
      • cpu usage increase over time in s? is it between 0 and 1 as in 0-100%?
      • resident memory variation in KB/s? is that a rate
  • scheduler dashboard
    • scheduling
      • attempts counter rate per result?? what does it mean?
    • workqueue
      • workqueue depth variation in s?
      • current unfinished work in s?
    • process
      • same questions as proxy dashboard
    • http requests
      • requests and responses is that a rate?

@gsantoro
Copy link
Contributor

gsantoro commented Jan 5, 2023

@constanca-m I assume this PR was raised to get up to date with latest release of Kubernetes. do you mind providing a bit more context in this PR (or link to an issue with the following details) about:

  • which K8s release we are trying to get up to date
  • where those metrics changes come from? I suppose the new release might have a list of added/removed metrics.

This is just so that we know in the future where those changes come from

@constanca-m
Copy link
Contributor Author

  * why is it called `Rules` for latency metrics?

I called it Rules since both the visualizations relate to proxy rules:

  1. SyncProxyRulesLatency is the latency of one round of kube-proxy syncing proxy rules.
  2. NetworkProgrammingLatency is defined as the time it took to program the network - from the time the service or pod has changed to the time the change was propagated and the proper kube-proxy rules were synced.

Latency metrics include more visualizations.

  * process       
    * `cpu usage increase over time` in s? is it between 0 and 1 as in 0-100%?

No, it means that between time1 and time2, the CPU time increased one more second, not 100%. In the screenshot, having 0 and 1 is just a coincidence, but the increase is in seconds, not in %. Possible follow up question: why in s and not in %? There is not really an answer on which is better than the other, as they would end up saying the same thing: the increase on CPU time spent. If you meant the % of CPU being used, then there is no metric for that, that I know of.

Metric being used for that visualization is process_cpu_seconds_total: total user and system CPU time spent in seconds.

    * `resident memory variation` in KB/s? is that a rate

Yes, but I think variation is a more clear word for that visualization.

* scheduler dashboard  
  * scheduling      
    * `attempts counter rate per result`?? what does it mean?

It means exactly that: the attempts counter rate split by result. If you check the metric being used, scheduler_scheduling_attempt_duration_seconds, you can see that is broken down by profile and result (code here). Possible results are scheduled, unschedulable and error (see here).

  * workqueue        
    * `workqueue depth variation` in s?

Not in seconds. It is a unit. As in, a queue at 10am can have 10 people, and at 11am, can have 12 people. Then the variation is 2/h people, (12-10)/h.

    * `current unfinished work` in s?

Per definition: how many seconds of work has done that is in progress and hasn't been considered in the longest running processor.

  * http requests
    * `requests and responses` is that a rate?

Yes.

which K8s release we are trying to get up to date
where those metrics changes come from? I suppose the new release might have a list of added/removed metrics.

The answer to these questions is in the README file inside the controller manager/scheduler/proxy folder.

@gsantoro

@gsantoro
Copy link
Contributor

gsantoro commented Jan 5, 2023

@constanca-m

I called it Rules since both the visualizations relate to proxy rules:
SyncProxyRulesLatency is the latency of one round of kube-proxy syncing proxy rules.
NetworkProgrammingLatency is defined as the time it took to program the network - from the time the service or pod has changed to the time the change was propagated and the proper kube-proxy rules were synced.

thanks for the explanation. I am wondering if there is a way to add the above bit somehow to the dashboard. Or even just a link to docs with that explanation might be ok.

No, it means that between time1 and time2, the CPU time increased one more second, not 100%. In the screenshot, having 0 and 1 is just a coincidence, but the increase is in seconds, not in %. Possible follow up question: why in s and not in %? There is not really an answer on which is better than the other, as they would end up saying the same thing: the increase on CPU time spent. If you meant the % of CPU being used, then there is no metric for that, that I know of.

Ok, but it's still a bit confusing that a CPU usage is measured in seconds. Let me see if I understand correctly, let's say in the 10s between two different mesurements, that CPU spent 1s doing its job. From the name process_cpu_seconds_total it's a bit clearer now that the time is the total time in seconds but it still doesn't clarify what's the time1 and time2 time range between two measurements. Is the time range up to us and the frequency of our Elastic probe? Do we do any extra processing on top to get from a seconds_total to a increase over time? Increase over time seems to suggest to me that it can only increase compared to the previous period. Maybe I am being pedantic but without any extra doc/links it is a bit unclear still.

....
for the rest of the answers, thanks for the explanation. it makes much more sense now. But I am still wondering if our visualisation names are enough to convey the meaning of a visualisation or if we should link them to docs with names of raw metrics used and their docs.

Overall I'm trying to reduce the number of SDH that might come our way because a customer doesn't fully grasp the meaning of a visualisation. Please let me know if this makes sense.

@constanca-m
Copy link
Contributor Author

constanca-m commented Jan 5, 2023

Ok, but it's still a bit confusing that a CPU usage is measured in seconds. Let me see if I understand correctly, let's say in the 10s between two different mesurements, that CPU spent 1s doing its job. From the name process_cpu_seconds_total it's a bit clearer now that the time is the total time in seconds but it still doesn't clarify what's the time1 and time2 time range between two measurements. Is the time range up to us and the frequency of our Elastic probe? Do we do any extra processing on top to get from a seconds_total to a increase over time? Increase over time seems to suggest to me that it can only increase compared to the previous period. Maybe I am being pedantic but without any extra doc/links it is a bit unclear still.

That is correct. The value can only increase, it is a counter.

Overall I'm trying to reduce the number of SDH that might come our way because a customer doesn't fully grasp the meaning of a visualisation. Please let me know if this makes sense.

This makes a lot of sense! I agree completely that the dashboards are not enough without more explanation, but the dashboards are also not the place to have them. So the solution would be, like you said, to create more documentation. I have talked about that with Miguel yesterday and the compromise is: I will create a draft for one of the components and then I will share it so we can see if it makes sense. The document will be added to our documentation (obs-docs) and it will be focused on the metrics we are using and what problems we can detect with them. For this, we will be using visualizations (like the counter rate for client error responses), and not the whole dashboard to save us the trouble of regular maintenance. It is not worth it to do that for all the components right now, without agreeing on a specific structure (it saves more time). I will try to do that in the upcoming days and share it with the team.

Edit: I create an issue for this elastic/observability-docs#2474

@constanca-m constanca-m merged commit 7dc5e22 into elastic:main Jan 5, 2023
mergify bot pushed a commit that referenced this pull request Jan 5, 2023
…proxy and controller manager (#34161)

* Update deprecated metrics from controller manager, scheduler and proxy

* Update dashboards

(cherry picked from commit 7dc5e22)
@@ -106,6 +106,7 @@ https://github.com/elastic/beats/compare/v8.2.0\...main[Check the HEAD diff]
- Fix kafka dashboard field names {pull}33555[33555]
- Add tags to events based on parsed identifier. {pull}33472[33472]
- Support Oracle-specific connection strings in SQL module {issue}32089[32089] {pull}32293[32293]
- Remove deprecated metrics from controller manager, scheduler and proxy {pull}34161[34161]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also add a new metrics, isn't it?

@@ -32,8 +32,8 @@ func TestEventMapping(t *testing.T) {
ptest.TestMetricSet(t, "kubernetes", "controllermanager",
ptest.TestCases{
{
MetricsFile: "./_meta/test/metrics.1.20",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropping 1.20 since this k8s version is not supported anymore https://endoflife.date/kubernetes

@@ -34,20 +34,24 @@ var mapping = &prometheus.MetricsMapping{
"process_open_fds": prometheus.Metric("process.fds.open.count"),
"process_max_fds": prometheus.Metric("process.fds.max.count"),
"process_start_time_seconds": prometheus.Metric("process.started.sec"),
// rest_client_request_duration_seconds buckets declared in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment not valid anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, I moved it to the README to be in the same file as the other resources

@@ -56,7 +60,6 @@ var mapping = &prometheus.MetricsMapping{
"host": prometheus.KeyLabel("host"),
"name": prometheus.KeyLabel("name"),
"zone": prometheus.KeyLabel("zone"),
"url": prometheus.KeyLabel("url"),
Copy link
Contributor

@tetianakravchenko tetianakravchenko Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fields was completely removed? from my understanding this fields was just recently introduced - https://github.com/elastic/beats/pull/32037/files#diff-dff4473d065113acb05a9f69b5854880be7761fe5a0267e99f869cb1503e9bde

is this field refer to rest-client metrics? kubernetes/component-base@fd4965b at some point url label was renamed to host, that we already use above, so we drop it in favor of using host, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, but I checked the supported versions. The oldest release date for a supported version is 07 Dec 2021 for 1.23. The commit that replace url by host was done on the 19th of November of 2021. I made the assumption that 1.23 version didn't support url anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the difference it's just a single field, I think it's more important to have back compatibility with 1.23 so I'll prefer to keep that field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to be right @tetianakravchenko, but how do we make sure of that? Do we compare all commits between the release date and today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsantoro Tanya doesn't intend to remove that field, but to use the same approach we use in here - two fields for the same label.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to be right @tetianakravchenko, but how do we make sure of that? Do we compare all commits between the release date and today?

@constanca-m I've selected the release-1.23 branch to see if this release still has url or host

@constanca-m
Copy link
Contributor Author

https://github.com/Mergifyio backport 8.6

@mergify
Copy link
Contributor

mergify bot commented Jan 12, 2023

backport 8.6

✅ Backports have been created

constanca-m pushed a commit that referenced this pull request Jan 12, 2023
…proxy and controller manager (#34161) (#34188)

* Update deprecated metrics from controller manager, scheduler and proxy

* Update dashboards

(cherry picked from commit 7dc5e22)

Co-authored-by: Constança Manteigas <constanca-m@elastic.co>
@constanca-m constanca-m deleted the fix-kubernetes-scheduler-metrics branch January 12, 2023 14:47
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
…proxy and controller manager (#34161)

* Update deprecated metrics from controller manager, scheduler and proxy

* Update dashboards
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants