Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WINC-544: Enhancement proposal for monitoring Windows Nodes #647

Merged

Conversation

VaishnaviHire
Copy link
Contributor

@VaishnaviHire VaishnaviHire commented Feb 11, 2021

Enhancement proposal for enabling monitoring on
Windows nodes created by Windows Machine Config Operator(WMCO).

@VaishnaviHire VaishnaviHire changed the title Enhancement proposal for monitoring [WIP]Enhancement proposal for monitoring Feb 11, 2021
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 11, 2021
@VaishnaviHire VaishnaviHire force-pushed the windows_monitoring branch 5 times, most recently from 3550806 to 852b693 Compare February 12, 2021 21:26
@russellb
Copy link
Member

I would suggest updating the PR title (and commit message) to reflect that this is about monitoring windows nodes to help make it more clear at a glance what this is about

@VaishnaviHire VaishnaviHire changed the title [WIP]Enhancement proposal for monitoring [WIP]Enhancement proposal for monitoring Windows Nodes Feb 16, 2021
@VaishnaviHire VaishnaviHire force-pushed the windows_monitoring branch 5 times, most recently from 946e3dc to d3cbd51 Compare February 16, 2021 21:30
@VaishnaviHire VaishnaviHire changed the title [WIP]Enhancement proposal for monitoring Windows Nodes WINC-544: Enhancement proposal for monitoring Windows Nodes Feb 16, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 16, 2021
@VaishnaviHire
Copy link
Contributor Author

/cc @sdodson

@VaishnaviHire
Copy link
Contributor Author

/cc @openshift/openshift-team-windows-containers

@VaishnaviHire
Copy link
Contributor Author

/cc @simonpasquier

This enhancement contains details regarding creating unified monitoring interface for Windows and Linux.

@VaishnaviHire
Copy link
Contributor Author

/cc @spadgett

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the approach of maintaining a custom endpoint object sounds ok to me. It can be noted that it is similar to what the Prometheus operator is doing with the kube-system/kubelet object.

@s-urbaniak for awareness.


### Future Plans

As we move forward, our plan to display monitoring graphs is to create a [common
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to offload the recording rules to the cluster-monitoring-operator (CMO) repository. My thinking was more that the Linux-specific rules would live in CMO while the Windows-specific rules would live in WMCO. Otherwise I fear that ownership will be diluted.
Another option would be to move all recording rules to the console operator but again my intuition is that the rules should stay close to the things that expose the metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how distributed ownership of recording rules will work if we are thinking of creating a unified interface. My opinion on this is that WMCO will be responsible for making sure that we receive metrics from windows nodes and CMO handles the complexity of grouping these(Linux + Windows) metrics to make the queries platform-independent. This will help in reducing the complexity on the Console side as well. See

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with pushing down all recording rules to CMO is that CMO doesn't have a way to validate that Windows-specific rules are correct. We should probably start by listing all metrics that are used by the console and identifying the gaps from the Windows side.

Also what is the general testing strategy to ensure that the console works properly for Windows?
@spadgett, I think that there are e2e console tests validating that the dashboards display correct result?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably start by listing all metrics that are used by the console and identifying the gaps from the Windows side.

@VaishnaviHire please look into this.

Also what is the general testing strategy to ensure that the console works properly for Windows?

We were discussing this as a console change broke the filesystem graphs for Windows nodes: BZ 1930347. The approach we were thinking about was to add a Windows e2e to the console repo or add console tests for validating the dashboards to the WMCO repo depending on whatever is easier. We will defer to @spadgett on what the better approach is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier I have included a list of metrics used for displaying console graphs and the corresponding Windows metrics in this doc. Please take a look. I have also added the cases where the existing queries won`t work for Windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to reflect Windows-specific rules will be added in WMCO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier Updated the future plans section to reflect the discussion from our meeting. Please review.

Comment on lines 156 to 192
* Since the windows-exporter is not running as a [pod](#justification), the
endpoint is not secure. The reason for this is when running inside a pod, we
can use CA signer for providing TLS cert/key to the service for
authentication. We plan to leverage windows_exporter's support for `https`
configuration. WMCO will be responsible for adding [web config](https://github.com/prometheus/exporter-toolkit/blob/master/docs/web-configuration.md)
for TLS. This will ensure that the metrics Endpoint will be able to
authenticate the requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the exporter toolkit supports TLS for client authentication, it would only validate that the certificate presented by the client is signed by the trusted certificate authority. For instance, it won't be able to validate the certificate subject. This is something I plan to bring upstream because we might need it for node_exporter too.

@VaishnaviHire VaishnaviHire force-pushed the windows_monitoring branch 3 times, most recently from 86f3541 to fdda97f Compare March 10, 2021 05:11
[support](https://github.com/prometheus-operator/prometheus-operator/issues/3862)
for EndpointSlices object.

#### Securing windows_exporter endpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't say how the windows exporter would get a certificate and key materials if it is configured with TLS.

we have a consistent user experience for monitoring across Linux and Windows.
* In the cases where `metric labels` are equivalent, we plan to relabel the
Windows metrics to align with the Linux metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a gap with the "USE Method / Node" dashboard in the monitoring menu.

image

There's already a dashboard for Windows in kubernetes-monitoring/kubernetes-mixin. IIUC it would be technically possible for WMCO to ship this dashboard as a configmap in the openshift-config-managed namespace with the console.openshift.io/dashboard=true label (the logging operator already does that) but it would require additional permissions for WMCO.

cc @spadgett to assess whether adding dashboards in the openshift-config-managed namespace is supported in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spadgett could you please take that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier Yes, you can definitely add dashboards to openshift-config-managed here. I think it makes a lot of sense, particularly if the Home -> Overview page isn't including metrics for Windows nodes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sub-section to address the grafana dashboard in the future plans section, @simonpasquier @spadgett PTAL.

Windows metrics to align with the Linux metrics.

##### Node Metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add another column explaining how each metric is going to be used by the console eventually.

In details:

  • node_memory_MemTotal_bytes / windows_cs_physical_memory_bytes => WMCO renames the Windows metric to node_memory_MemTotal_bytes because there are no label differences
  • node_memory_MemAvailable_bytes / windows_memory_available_bytes => WMCO renames the Windows metric to node_memory_MemTotal_bytes because there are no label differences
  • node_filesystem_size_bytes & node_filesystem_free_bytes / windows_logical_disk_size_bytes & windows_logical_disk_free_bytes => today the console uses these metrics to report the storage capacity (example), 2 new recording rules should be created by CMO and WMCO which would aggregate metrics per node/instance. Once available the console can use them instead of the custom queries.
    • instance:filesystem_size_bytes:sum
    • instance:filesystem_used_bytes:sum
  • node_cpu_seconds_total => console uses metrics from recording rules (e.g. instance:node_cpu:rate:sum), WMCO needs to produce similar metrics.

| node_filesystem_free_bytes | windows_logical_disk_free_bytes | Missing Label: device, mountpoint, fstype) Additional label : (volume) |
| node_cpu_seconds_total | windows_cpu_time_total | Missing Label : cpu Additional Label: core |

##### Pod Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, we need to identify where recording rules would need to be created/updated vs. where WMCO can modify the metrics to look like metrics from Linux nodes.

@mansikulkarni96
Copy link
Member

@simonpasquier updated PR PTAL.

@mansikulkarni96 mansikulkarni96 force-pushed the windows_monitoring branch 3 times, most recently from 543cebd to 12004c8 Compare March 29, 2021 17:44
@mansikulkarni96 mansikulkarni96 force-pushed the windows_monitoring branch 2 times, most recently from 19af7be to ce41d4a Compare April 5, 2021 20:33
@aravindhp
Copy link
Contributor

@simonpasquier @spadgett please review

Copy link
Member

@spadgett spadgett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The console changes LGTM. I'll let @simonpasquier and others weigh in on the other areas.

/approve


As part of this enhancement, we do not plan to do the following:
* Integrating windows_exporter with cluster monitoring operator
* Ship Grafana dashboards for Windows Nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we do plan to do this now? (Using console Monitoring -> Dashboards.)

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: spadgett

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2021
@aravindhp
Copy link
Contributor

@simonpasquier we are hoping to close this out at the end of this sprint. Please let us know if more changes are needed.

@aravindhp
Copy link
Contributor

/lgtm

@simonpasquier looks all open items have been addressed by @mansikulkarni96. Please comment if you want any changes and we will handle that with a follow up PR.

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels Apr 30, 2021
Enhancement proposal for enabling monitoring on
Windows nodes created by Windows Machine Config Operator(WMCO).
@aravindhp
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 3, 2021
@openshift-merge-robot openshift-merge-robot merged commit 0e2f247 into openshift:master May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants