-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add master node telemetry device #1330
Conversation
When running benchmarks on clusters with multiple nodes, the master node could perform worse due to the additional responsibilities it needs to handle. To identify this, we add a telemetry device to record the master node. We don't expect this to change during a typical benchmark, which is why the default sample interval is 30s. To compensate this large sample interval, we also record that information on benchmark start/end.
tests/telemetry_test.py
Outdated
t.on_benchmark_stop() | ||
|
||
# the two clients record start and stop | ||
assert record_count == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth I tried using @mock.patch
and call_count
here but I was only seeing 2 calls, not 4. I still don't know why. I ended up switching to the pytest style which is more verbose but less magical.
@sethmlarson mentioned to me that we can use |
IMO that |
esrally/driver/driver.py
Outdated
@@ -610,6 +610,7 @@ def prepare_telemetry(self, es, enable): | |||
telemetry.TransformStats(telemetry_params, es, self.metrics_store), | |||
telemetry.SearchableSnapshotsStats(telemetry_params, es, self.metrics_store), | |||
telemetry.DataStreamStats(telemetry_params, es, self.metrics_store), | |||
telemetry.MasterNodeStats(telemetry_params, es, self.metrics_store), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, if you want this to be an internal telemetry device that runs by default then you can just pass es_default
in as below:
rally/esrally/driver/driver.py
Lines 601 to 603 in 4355771
telemetry.ExternalEnvironmentInfo(es_default, self.metrics_store), | |
telemetry.ClusterEnvironmentInfo(es_default, self.metrics_store), | |
telemetry.JvmStatsSummary(es_default, self.metrics_store), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that, thanks! But note that es_default
vs. es
is not about about internal/always-on vs. optional: https://elastic.slack.com/archives/C0JGTNK0S/p1631778220222700
docs/telemetry.rst
Outdated
@@ -30,6 +30,7 @@ You probably want to gain additional insights from a race. Therefore, we have ad | |||
searchable-snapshots-stats Searchable Snapshots Stats Regularly samples searchable snapshots stats | |||
shard-stats Shard Stats Regularly samples nodes stats at shard level | |||
data-stream-stats Data Streams Stats Regularly samples data streams stats | |||
master-node-stats Master Node Stats Regularly determines the current master node name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intention was to make this an internal telemetry device, so no need to expose any info of command line arguments here. See
Line 1523 in 4355771
class ClusterEnvironmentInfo(InternalTelemetryDevice): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collect master node during benchmark execution #1324 mentioned that this device should be always on, how can I do that?
See comments around setting this as an InternalTelemetryDevice
.
Is it OK to use those APIs to retrieve the information, or should we revert to the cat API which is much less verbose? (Thanks @b-deam for the help on this!)
As per comments RE: cat
API :-)
I recorded the master at the start and end of a benchmark, is that OK?
I'd say it's preferable to sample this semi-regularly. 30s seems like a good compromise, we don't expect master nodes to change much/if at all during a benchmark, and if it does that usually indicates some stability issues.
It avoids making two calls and returns only the data we need.
We record it on start anyway, and recording on end does not seem to be considered useful.
We can always support multiple clusters in the future.
f036006
to
95cdf7e
Compare
I addressed all review comments, please take another look |
ℹ️ A seemingly unrelated test fails in a deterministic way on CI, both for Python 3.8 and Python 3.9, but I don't reproduce the failure locally yet. Test failure output
|
cc354c3
to
9de9e59
Compare
docs/telemetry.rst
Outdated
The master-node-stats telemetry device regularly reports the name of the master node by using: | ||
|
||
1. the `Cluster State API <https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state.html>`_, specifically `/_cluster/state/master_node` to determine the id of the master node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The docs here are outdated now since we just use the cat client.
Though, I also wonder if we should even expose this in the docs given we intend for it to be an internal telemetry device that's always on? @danielmitterdorfer, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we should skip that for internal telemetry devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I forgot to remove it when I updated the CLI output at the top of the page. Nice catch!
I removed it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it out and it works well! Thanks :-) Just one quick nit around the docs, and wondering whether we should expose it at all if we intend for it to be always on..
It's an internal telemetry device now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This will be super handy when analysing benchmark results.
When running benchmarks on clusters with multiple nodes, the master node
could perform worse due to the additional responsibilities it needs to
handle.
To identify this, we add a telemetry device to record the master node.
We don't expect this to change during a typical benchmark, which is why
the default sample interval is 30s. To compensate this large sample
interval, we also record that information on benchmark start/end.
I tested this feature with a short sample interval in test mode, checking that the information was indeed recorded in the test store:
TODO: