-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed collection method of the second counter value in the Windows Perfmon module of Metricbeat #32305
Conversation
💚 CLA has been signed |
cla/check |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
This pull request is now in conflicts. Could you fix it? 🙏
|
/test |
This pull request is now in conflicts. Could you fix it? 🙏
|
/test |
The tests won't run until the conflict is resolved it seems. |
/test |
@jorgeta Thank you so much for this contribution and the super detailed explanation! I started to look at your code, but the PR is bring many commits that were not made by you, making it pretty hard to review/understand what has changed. Could you look into it and fix your branch/PR to contain only your commits? You don't need to worry about keeping your PR up-to-date with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting the changes I explained on my previous commit.
This pull request is now in conflicts. Could you fix it? 🙏
|
accidentally added another PR's changelog description Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
This pull request does not have a backport label.
|
/test |
Some of the lint/check steps are failing for metricbeat: https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-32305/23/pipeline
|
Is this related to the importing of the fmt package? I tried to make some changes unrelated to the change of the collection method, so that the github actions linting would be approved - but maybe I should leave that? Although then the checks would fail in several places. |
It looks like it is just a file that needs to be formatted, if you run diff --git a/metricbeat/module/windows/perfmon/reader.go b/metricbeat/module/windows/perfmon/reader.go
index 04063eb084..645938f882 100644
--- a/metricbeat/module/windows/perfmon/reader.go
+++ b/metricbeat/module/windows/perfmon/reader.go
@@ -24,8 +24,8 @@ import (
"fmt"
"regexp"
"strings"
- "unicode"
"time"
+ "unicode"
"github.com/elastic/beats/v7/metricbeat/helper/windows/pdh"
@@ -132,7 +132,7 @@ func (re *Reader) getValues() (map[string][]pdh.CounterValue, error) {
if err != nil {
return nil, err
}
-
+
// Collect the displayable value.
val, err = re.query.GetFormattedCounterValues()
if err != nil { |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit.
Minor change to error string Co-authored-by: Dan Kortschak <90160302+efd6@users.noreply.github.com>
@jorgeta, we had this implementation for some time and we switched last year to the Have you tested this implementation while configuring a higher number of counters? Are you getting a constant results? |
Hi, @narph - thanks for your feedback. We have tested the implementation for a different amount of counters; from having just the one, and up to 90 "windows.perfmon.*" counters, also along with additional 54 "system.*" (that is, tests have collected between 1 and 144 counters in total). I could try to test with a larger number - if this is not considered large. So, as far as I can see, the implementation in this PR is still different from the one that was before |
This pull request is now in conflicts. Could you fix it? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no Windows expert, but it looks good to me. Just the small changelog fix that's needed.
Co-authored-by: Tiago Queiroz <contato@tiago.eti.br>
@fearful-symmetry can you review this as well? @jorgeta sorry this is taking so long, this is a tricky piece of code with a history of issues as I'm sure you know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good from my side, but, please, wait for @fearful-symmetry or someone with more expertise on this area than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I don't have much windows knowledge, but just based on the bits I do understand, this LGTM.
Thanks everyone, merging this with 8.5 as the target version. We can have our QA team check this on a few different versions of Windows as a final sanity check. |
Thanks everyone! @cmacknz Would it be possible to backport the changes to version 7? Upgrading to a new version in our company is rather a huge project that is somewhat further down the line, and we would hope to have this fix available as soon as possible. |
Sure, this is effectively a bug fix so we can backport to 7.17 |
…Perfmon module of Metricbeat (#32305) * changed collection method of the second counter value required to create a displayable value Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co> Co-authored-by: Dan Kortschak <90160302+efd6@users.noreply.github.com> Co-authored-by: Tiago Queiroz <contato@tiago.eti.br> (cherry picked from commit f0bc6c8) # Conflicts: # metricbeat/module/windows/perfmon/reader.go
…er value in the Windows Perfmon module of Metricbeat (#32992) * Changed collection method of the second counter value in the Windows Perfmon module of Metricbeat (#32305) * changed collection method of the second counter value required to create a displayable value Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co> Co-authored-by: Dan Kortschak <90160302+efd6@users.noreply.github.com> Co-authored-by: Tiago Queiroz <contato@tiago.eti.br> (cherry picked from commit f0bc6c8) # Conflicts: # metricbeat/module/windows/perfmon/reader.go * Resolve merge conflicts. Co-authored-by: Jørgen Taule <36794556+jorgeta@users.noreply.github.com> Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
…Perfmon module of Metricbeat (#32305) * changed collection method of the second counter value required to create a displayable value Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co> Co-authored-by: Dan Kortschak <90160302+efd6@users.noreply.github.com> Co-authored-by: Tiago Queiroz <contato@tiago.eti.br>
What does this PR do?
There are two ways of collecting performance counters in Windows, as explained here.
The current Metricbeat implementation uses the one where PdhCollectQueryDataEx is called to retrieve the second raw value that is required to calculate the displayable value.
This PR changes this collection method to a simpler one, the one that calls PdhCollectQueryData twice, with a sleep of one second in between.
The function
randSeq
was implemented in an earlier version in order to support the current implementation, but is not used anymore. The deletion of this function is therefore just considered as cleanup.Why is it important?
Collection of the counter "Processor(_Total) % Processor Time" gave false 100 % spikes. Testing on different computers and servers showed that when the sampling frequency, or period, was larger than one second, one would obtain these false 100 % values. They have been checked for being false by comparing with Windows own Performance Monitor. When running several instances of Metricbeat alongside each other, spikes are measured for the instances
This is part of one configuration file used when testing, where the sampling frequency/period is set to 2 seconds:
In the two figures below (Figfure (1) and Figure (2)), the current implementation (named "thread"), is compared with the implementation of this PR (named "sleep"), for different sampling frequencies/periods. The names "sleep1s" and "thread1s" indicates that the collection interval, the time between doing a PdhCollectQueryData call, is hardcoded to 1s in the version.
In Figure (1), each point represents a maximum value over a period of 60 seconds, and explains why the different versions with different frequencies show different values, except for sleep1s-freq1s and thread1s-freq1s, which provide almost identical values (as expected). What this figure shows is that the thread1s-freq2s and thread1s-freq10s produce false 100 % spikes.
Zooming in on one of the 100 % spikes has been done in Figure (2), where thread1s-freq10s provides one 100 % value that is most certainly false. In addition, thread1s-freq2s produces 0 % values that does not seem correct either.
Figure (1)
Figure (2)
The figures above show tests performed using Metricbeat 7.17.2, but they have also been reproduced using Metricbeat 8.3.1. They have also been reproduced on different computers and servers, a database server for example. The code in question is the same in both versions, so this is also expected.
As one also can see from the figures above, and also generally in all the experiments done, the sleep method always produces values that seem correct and that conform with what Windows Performance Monitor is also observing. The problem only arises when the sampling frequency/period is larger than the collection interval (tests have also been made with larger collection intervals, versions of Metricbeat called "sleep2s" and "thread2s", etc.), for the current collection method. It therefore seems like there is some bug concerning the sampling frequency/period of Metricbeat in combination with the creating of the thread. This PR does not explain what this issue is precisely, but simply provides an alternative where this problem does not arise.
In order to get a closer look at what was going on, tests were run where the raw values were fetched after calling the PdhCollectQueryData function. When using either the sleep method, or the thread method with a sampling frequency/period that is not larger than the collection interval (1 second), these raw values were collected at the correct times, and the values could be used to verify the fetched displayable value. Below are some examples of logs of two consecutive samples for these cases. An entry saying "Raw values: \Processor(_Total)\% Processor Time:[{_Total 5870585175781 133017943902193928 { 0}}]])", means that the FirstValue of the raw value is 5870585175781 and the SecondValue is 133017943902193928. The displayable value is calculated using taking two consecutive raw values, and then take one minus the difference of the FirstValues divided by the difference of the SecondValues, and at the end multiply by 100 % to get the percent value. The formula is also found here. In the below examples, this adds up in all cases.
sleep1s-freq1s
sleep1s-freq10s
thread1s-freq1s
It is, however, not the case when using the current implementation (thread method), with a sampling frequency. Both raw value samples are identical, and hence one does not obtain the fetched displayable value when plugging them into the formula. Here is one example of such logs:
thread1s-freq10s
Looking into the implementation history, there does not seem to be any explanation of why the current collection method was chosen - especially with this PR's simpler alternative.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
Related issues
Use cases
Screenshots
Logs