Add master node telemetry device #1330

pquentin · 2021-09-15T12:02:54Z

When running benchmarks on clusters with multiple nodes, the master node
could perform worse due to the additional responsibilities it needs to
handle.

To identify this, we add a telemetry device to record the master node.
We don't expect this to change during a typical benchmark, which is why
the default sample interval is 30s. To compensate this large sample
interval, we also record that information on benchmark start/end.

I tested this feature with a short sample interval in test mode, checking that the information was indeed recorded in the test store:

esrally race --distribution-version=7.14.1 --car="4gheap" --track=geonames --test-mode --telemetry master-node-stats --telemetry-params="master-node-stats-sample-interval: 0.1"
http localhost:9200/rally-metrics-2021-09/_search\?size=1000 | jq . | grep master-node -c

TODO:

Collect master node during benchmark execution #1324 mentioned that this device should be always on, how can I do that?
Is it OK to use those APIs to retrieve the information, or should we revert to the cat API which is much less verbose? (Thanks @b-deam for the help on this!)
I recorded the master at the start and end of a benchmark, is that OK?
Fix CI

When running benchmarks on clusters with multiple nodes, the master node could perform worse due to the additional responsibilities it needs to handle. To identify this, we add a telemetry device to record the master node. We don't expect this to change during a typical benchmark, which is why the default sample interval is 30s. To compensate this large sample interval, we also record that information on benchmark start/end.

pquentin · 2021-09-15T13:57:51Z

tests/telemetry_test.py

+        t.on_benchmark_stop()
+
+        # the two clients record start and stop
+        assert record_count == 4


For what it's worth I tried using @mock.patch and call_count here but I was only seeing 2 calls, not 4. I still don't know why. I ended up switching to the pytest style which is more verbose but less magical.

pquentin · 2021-09-15T14:02:04Z

Is it OK to use those APIs to retrieve the information, or should we revert to the cat API which is much less verbose? (Thanks @b-deam for the help on this!)a_Run

@sethmlarson mentioned to me that we can use format="json" for the cat API, and it's stable, so maybe that's best in this case (only one call, less data to transfer). Will let you decide, switching is easy in any case.

esrally/telemetry.py

b-deam · 2021-09-16T01:32:01Z

Is it OK to use those APIs to retrieve the information, or should we revert to the cat API which is much less verbose? (Thanks @b-deam for the help on this!)a_Run

@sethmlarson mentioned to me that we can use format="json" for the cat API, and it's stable, so maybe that's best in this case (only one call, less data to transfer). Will let you decide, switching is easy in any case.

IMO that cat approach is certainly easiest. If @sethmlarson says it's stable, then I'd trust him ;-), in either case I'd probably want to avoid extraneous API calls (e.g. to Node Info or others).

b-deam · 2021-09-16T02:40:06Z

esrally/driver/driver.py

@@ -610,6 +610,7 @@ def prepare_telemetry(self, es, enable):
                telemetry.TransformStats(telemetry_params, es, self.metrics_store),
                telemetry.SearchableSnapshotsStats(telemetry_params, es, self.metrics_store),
                telemetry.DataStreamStats(telemetry_params, es, self.metrics_store),
+                telemetry.MasterNodeStats(telemetry_params, es, self.metrics_store),


As above, if you want this to be an internal telemetry device that runs by default then you can just pass es_default in as below:

rally/esrally/driver/driver.py

Lines 601 to 603 in 4355771

telemetry.ExternalEnvironmentInfo(es_default, self.metrics_store),

telemetry.ClusterEnvironmentInfo(es_default, self.metrics_store),

telemetry.JvmStatsSummary(es_default, self.metrics_store),

I did that, thanks! But note that es_default vs. es is not about about internal/always-on vs. optional: https://elastic.slack.com/archives/C0JGTNK0S/p1631778220222700

b-deam · 2021-09-16T02:40:43Z

docs/telemetry.rst

@@ -30,6 +30,7 @@ You probably want to gain additional insights from a race. Therefore, we have ad
   searchable-snapshots-stats  Searchable Snapshots Stats  Regularly samples searchable snapshots stats
   shard-stats                 Shard Stats                 Regularly samples nodes stats at shard level
   data-stream-stats           Data Streams Stats          Regularly samples data streams stats
+   master-node-stats           Master Node Stats           Regularly determines the current master node name


I think the intention was to make this an internal telemetry device, so no need to expose any info of command line arguments here. See

rally/esrally/telemetry.py

Line 1523 in 4355771

class ClusterEnvironmentInfo(InternalTelemetryDevice):

for an example.

Thanks, fixed

b-deam

Collect master node during benchmark execution #1324 mentioned that this device should be always on, how can I do that?

See comments around setting this as an InternalTelemetryDevice.

Is it OK to use those APIs to retrieve the information, or should we revert to the cat API which is much less verbose? (Thanks @b-deam for the help on this!)

As per comments RE: cat API :-)

I recorded the master at the start and end of a benchmark, is that OK?

I'd say it's preferable to sample this semi-regularly. 30s seems like a good compromise, we don't expect master nodes to change much/if at all during a benchmark, and if it does that usually indicates some stability issues.

It avoids making two calls and returns only the data we need.

We record it on start anyway, and recording on end does not seem to be considered useful.

We can always support multiple clusters in the future.

pquentin · 2021-09-16T10:48:04Z

I addressed all review comments, please take another look

pquentin · 2021-09-16T12:38:41Z

ℹ️ A seemingly unrelated test fails in a deterministic way on CI, both for Python 3.8 and Python 3.9, but I don't reproduce the failure locally yet.

Test failure output

16:26:23 ______ DriverTests.test_client_reaches_join_point_which_completes_parent _______
16:26:23 
16:26:23 self = <tests.driver.driver_test.DriverTests testMethod=test_client_reaches_join_point_which_completes_parent>
16:26:23 
16:26:23     def test_client_reaches_join_point_which_completes_parent(self):
16:26:23         target = self.create_test_driver_target()
16:26:23         d = driver.Driver(target, self.cfg, es_client_factory_class=DriverTests.StaticClientFactory)
16:26:23     
16:26:23         d.prepare_benchmark(t=self.track)
16:26:23         d.start_benchmark()
16:26:23     
16:26:23         self.assertEqual(0, len(d.workers_completed_current_step))
16:26:23     
16:26:23         d.joinpoint_reached(
16:26:23             worker_id=0,
16:26:23             worker_local_timestamp=10,
16:26:23             task_allocations=[driver.ClientAllocation(client_id=0, task=driver.JoinPoint(id=0, clients_executing_completing_task=[0]))],
16:26:23         )
16:26:23     
16:26:23         self.assertEqual(-1, d.current_step)
16:26:23         self.assertEqual(1, len(d.workers_completed_current_step))
16:26:23         # notified all drivers that they should complete the current task ASAP
16:26:23         self.assertEqual(4, target.complete_current_task.call_count)
16:26:23     
16:26:23         # awaiting responses of other clients
16:26:23         d.joinpoint_reached(
16:26:23             worker_id=1,
16:26:23             worker_local_timestamp=11,
16:26:23             task_allocations=[driver.ClientAllocation(client_id=1, task=driver.JoinPoint(id=0, clients_executing_completing_task=[0]))],
16:26:23         )
16:26:23     
16:26:23         self.assertEqual(-1, d.current_step)
16:26:23         self.assertEqual(2, len(d.workers_completed_current_step))
16:26:23     
16:26:23         d.joinpoint_reached(
16:26:23             worker_id=2,
16:26:23             worker_local_timestamp=12,
16:26:23             task_allocations=[driver.ClientAllocation(client_id=2, task=driver.JoinPoint(id=0, clients_executing_completing_task=[0]))],
16:26:23         )
16:26:23         self.assertEqual(-1, d.current_step)
16:26:23         self.assertEqual(3, len(d.workers_completed_current_step))
16:26:23     
16:26:23 >       d.joinpoint_reached(
16:26:23             worker_id=3,
16:26:23             worker_local_timestamp=13,
16:26:23             task_allocations=[driver.ClientAllocation(client_id=3, task=driver.JoinPoint(id=0, clients_executing_completing_task=[0]))],
16:26:23         )
16:26:23 
16:26:23 tests/driver/driver_test.py:226: 
16:26:23 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
16:26:23 esrally/driver/driver.py:752: in joinpoint_reached
16:26:23     self.move_to_next_task(workers_curr_step)
16:26:23 esrally/driver/driver.py:768: in move_to_next_task
16:26:23     m = self.metrics_store.to_externalizable(clear=True)
16:26:23 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
16:26:23 
16:26:23 self = <esrally.metrics.InMemoryMetricsStore object at 0x7f0b7ce452e0>
16:26:23 clear = True
16:26:23 
16:26:23     def to_externalizable(self, clear=False):
16:26:23         docs = self.docs
16:26:23         if clear:
16:26:23             self.docs = []
16:26:23 >       compressed = zlib.compress(pickle.dumps(docs))
16:26:23 E       _pickle.PicklingError: Can't pickle <class 'unittest.mock.MagicMock'>: it's not the same object as unittest.mock.MagicMock

Thanks to suggestion from Daniel

b-deam · 2021-09-22T01:57:53Z

docs/telemetry.rst

+The master-node-stats telemetry device regularly reports the name of the master node by using:
+
+ 1. the `Cluster State API <https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state.html>`_, specifically `/_cluster/state/master_node` to determine the id of the master node


Nit: The docs here are outdated now since we just use the cat client.

Though, I also wonder if we should even expose this in the docs given we intend for it to be an internal telemetry device that's always on? @danielmitterdorfer, thoughts?

I agree, we should skip that for internal telemetry devices.

Thanks, I forgot to remove it when I updated the CLI output at the top of the page. Nice catch!

I removed it now.

b-deam

I tested it out and it works well! Thanks :-) Just one quick nit around the docs, and wondering whether we should expose it at all if we intend for it to be always on..

It's an internal telemetry device now.

b-deam

LGTM! This will be super handy when analysing benchmark results.

pquentin added enhancement Improves the status quo :Telemetry Telemetry Devices that gather additional metrics labels Sep 15, 2021

pquentin added this to the 2.3.0 milestone Sep 15, 2021

pquentin requested review from danielmitterdorfer and b-deam September 15, 2021 12:02

pquentin self-assigned this Sep 15, 2021

pquentin commented Sep 15, 2021

View reviewed changes

esrally/telemetry.py Outdated Show resolved Hide resolved

b-deam reviewed Sep 16, 2021

View reviewed changes

pquentin added 4 commits September 16, 2021 10:38

Retrieve master node using cat API

fbcf365

It avoids making two calls and returns only the data we need.

Stop recording master node on start/end

dcbee2e

We record it on start anyway, and recording on end does not seem to be considered useful.

Collect master node stats on default cluster only

ce51626

We can always support multiple clusters in the future.

Run MasterNodeStats telemetry device by default

95cdf7e

pquentin force-pushed the master-node-telemetry branch from f036006 to 95cdf7e Compare September 16, 2021 10:46

Merge remote-tracking branch 'origin/master' into master-node-telemetry

64d1d79

Fix build

9de9e59

Thanks to suggestion from Daniel

pquentin force-pushed the master-node-telemetry branch from cc354c3 to 9de9e59 Compare September 20, 2021 11:54

pquentin requested a review from b-deam September 21, 2021 09:35

b-deam reviewed Sep 22, 2021

View reviewed changes

pquentin requested a review from b-deam September 22, 2021 07:39

Remove master-node-state docs

7532885

It's an internal telemetry device now.

b-deam approved these changes Sep 22, 2021

View reviewed changes

pquentin merged commit b540f56 into elastic:master Sep 23, 2021

pquentin deleted the master-node-telemetry branch September 23, 2021 05:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add master node telemetry device #1330

Add master node telemetry device #1330

pquentin commented Sep 15, 2021 •

edited

Loading

pquentin Sep 15, 2021

pquentin commented Sep 15, 2021

b-deam commented Sep 16, 2021

b-deam Sep 16, 2021

pquentin Sep 16, 2021

b-deam Sep 16, 2021

pquentin Sep 16, 2021

b-deam left a comment

pquentin commented Sep 16, 2021

pquentin commented Sep 16, 2021

b-deam Sep 22, 2021 •

edited

Loading

danielmitterdorfer Sep 22, 2021

pquentin Sep 22, 2021

b-deam left a comment

b-deam left a comment

	telemetry.ExternalEnvironmentInfo(es_default, self.metrics_store),
	telemetry.ClusterEnvironmentInfo(es_default, self.metrics_store),
	telemetry.JvmStatsSummary(es_default, self.metrics_store),

		The master-node-stats telemetry device regularly reports the name of the master node by using:

		1. the `Cluster State API <https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state.html>`_, specifically `/_cluster/state/master_node` to determine the id of the master node

Add master node telemetry device #1330

Add master node telemetry device #1330

Conversation

pquentin commented Sep 15, 2021 • edited Loading

pquentin Sep 15, 2021

Choose a reason for hiding this comment

pquentin commented Sep 15, 2021

b-deam commented Sep 16, 2021

b-deam Sep 16, 2021

Choose a reason for hiding this comment

pquentin Sep 16, 2021

Choose a reason for hiding this comment

b-deam Sep 16, 2021

Choose a reason for hiding this comment

pquentin Sep 16, 2021

Choose a reason for hiding this comment

b-deam left a comment

Choose a reason for hiding this comment

pquentin commented Sep 16, 2021

pquentin commented Sep 16, 2021

b-deam Sep 22, 2021 • edited Loading

Choose a reason for hiding this comment

danielmitterdorfer Sep 22, 2021

Choose a reason for hiding this comment

pquentin Sep 22, 2021

Choose a reason for hiding this comment

b-deam left a comment

Choose a reason for hiding this comment

b-deam left a comment

Choose a reason for hiding this comment

pquentin commented Sep 15, 2021 •

edited

Loading

b-deam Sep 22, 2021 •

edited

Loading