Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add master node telemetry device #1330

Merged
merged 8 commits into from
Sep 23, 2021

Conversation

pquentin
Copy link
Member

@pquentin pquentin commented Sep 15, 2021

When running benchmarks on clusters with multiple nodes, the master node
could perform worse due to the additional responsibilities it needs to
handle.

To identify this, we add a telemetry device to record the master node.
We don't expect this to change during a typical benchmark, which is why
the default sample interval is 30s. To compensate this large sample
interval, we also record that information on benchmark start/end.

I tested this feature with a short sample interval in test mode, checking that the information was indeed recorded in the test store:

esrally race --distribution-version=7.14.1 --car="4gheap" --track=geonames --test-mode --telemetry master-node-stats --telemetry-params="master-node-stats-sample-interval: 0.1"
http localhost:9200/rally-metrics-2021-09/_search\?size=1000 | jq . | grep master-node -c 

TODO:

  • Collect master node during benchmark execution #1324 mentioned that this device should be always on, how can I do that?
  • Is it OK to use those APIs to retrieve the information, or should we revert to the cat API which is much less verbose? (Thanks @b-deam for the help on this!)
  • I recorded the master at the start and end of a benchmark, is that OK?
  • Fix CI

When running benchmarks on clusters with multiple nodes, the master node
could perform worse due to the additional responsibilities it needs to
handle.

To identify this, we add a telemetry device to record the master node.
We don't expect this to change during a typical benchmark, which is why
the default sample interval is 30s. To compensate this large sample
interval, we also record that information on benchmark start/end.
@pquentin pquentin added enhancement Improves the status quo :Telemetry Telemetry Devices that gather additional metrics labels Sep 15, 2021
@pquentin pquentin added this to the 2.3.0 milestone Sep 15, 2021
@pquentin pquentin self-assigned this Sep 15, 2021
t.on_benchmark_stop()

# the two clients record start and stop
assert record_count == 4
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth I tried using @mock.patch and call_count here but I was only seeing 2 calls, not 4. I still don't know why. I ended up switching to the pytest style which is more verbose but less magical.

@pquentin
Copy link
Member Author

Is it OK to use those APIs to retrieve the information, or should we revert to the cat API which is much less verbose? (Thanks @b-deam for the help on this!)a_Run

@sethmlarson mentioned to me that we can use format="json" for the cat API, and it's stable, so maybe that's best in this case (only one call, less data to transfer). Will let you decide, switching is easy in any case.

esrally/telemetry.py Outdated Show resolved Hide resolved
@b-deam
Copy link
Member

b-deam commented Sep 16, 2021

Is it OK to use those APIs to retrieve the information, or should we revert to the cat API which is much less verbose? (Thanks @b-deam for the help on this!)a_Run

@sethmlarson mentioned to me that we can use format="json" for the cat API, and it's stable, so maybe that's best in this case (only one call, less data to transfer). Will let you decide, switching is easy in any case.

IMO that cat approach is certainly easiest. If @sethmlarson says it's stable, then I'd trust him ;-), in either case I'd probably want to avoid extraneous API calls (e.g. to Node Info or others).

@@ -610,6 +610,7 @@ def prepare_telemetry(self, es, enable):
telemetry.TransformStats(telemetry_params, es, self.metrics_store),
telemetry.SearchableSnapshotsStats(telemetry_params, es, self.metrics_store),
telemetry.DataStreamStats(telemetry_params, es, self.metrics_store),
telemetry.MasterNodeStats(telemetry_params, es, self.metrics_store),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, if you want this to be an internal telemetry device that runs by default then you can just pass es_default in as below:

telemetry.ExternalEnvironmentInfo(es_default, self.metrics_store),
telemetry.ClusterEnvironmentInfo(es_default, self.metrics_store),
telemetry.JvmStatsSummary(es_default, self.metrics_store),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that, thanks! But note that es_default vs. es is not about about internal/always-on vs. optional: https://elastic.slack.com/archives/C0JGTNK0S/p1631778220222700

@@ -30,6 +30,7 @@ You probably want to gain additional insights from a race. Therefore, we have ad
searchable-snapshots-stats Searchable Snapshots Stats Regularly samples searchable snapshots stats
shard-stats Shard Stats Regularly samples nodes stats at shard level
data-stream-stats Data Streams Stats Regularly samples data streams stats
master-node-stats Master Node Stats Regularly determines the current master node name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intention was to make this an internal telemetry device, so no need to expose any info of command line arguments here. See

class ClusterEnvironmentInfo(InternalTelemetryDevice):
for an example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

Copy link
Member

@b-deam b-deam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collect master node during benchmark execution #1324 mentioned that this device should be always on, how can I do that?

See comments around setting this as an InternalTelemetryDevice.

Is it OK to use those APIs to retrieve the information, or should we revert to the cat API which is much less verbose? (Thanks @b-deam for the help on this!)

As per comments RE: cat API :-)

I recorded the master at the start and end of a benchmark, is that OK?

I'd say it's preferable to sample this semi-regularly. 30s seems like a good compromise, we don't expect master nodes to change much/if at all during a benchmark, and if it does that usually indicates some stability issues.

It avoids making two calls and returns only the data we need.
We record it on start anyway, and recording on end does not seem to be
considered useful.
We can always support multiple clusters in the future.
@pquentin pquentin force-pushed the master-node-telemetry branch from f036006 to 95cdf7e Compare September 16, 2021 10:46
@pquentin
Copy link
Member Author

I addressed all review comments, please take another look

@pquentin
Copy link
Member Author

ℹ️ A seemingly unrelated test fails in a deterministic way on CI, both for Python 3.8 and Python 3.9, but I don't reproduce the failure locally yet.

Test failure output
16:26:23 ______ DriverTests.test_client_reaches_join_point_which_completes_parent _______
16:26:23 
16:26:23 self = <tests.driver.driver_test.DriverTests testMethod=test_client_reaches_join_point_which_completes_parent>
16:26:23 
16:26:23     def test_client_reaches_join_point_which_completes_parent(self):
16:26:23         target = self.create_test_driver_target()
16:26:23         d = driver.Driver(target, self.cfg, es_client_factory_class=DriverTests.StaticClientFactory)
16:26:23     
16:26:23         d.prepare_benchmark(t=self.track)
16:26:23         d.start_benchmark()
16:26:23     
16:26:23         self.assertEqual(0, len(d.workers_completed_current_step))
16:26:23     
16:26:23         d.joinpoint_reached(
16:26:23             worker_id=0,
16:26:23             worker_local_timestamp=10,
16:26:23             task_allocations=[driver.ClientAllocation(client_id=0, task=driver.JoinPoint(id=0, clients_executing_completing_task=[0]))],
16:26:23         )
16:26:23     
16:26:23         self.assertEqual(-1, d.current_step)
16:26:23         self.assertEqual(1, len(d.workers_completed_current_step))
16:26:23         # notified all drivers that they should complete the current task ASAP
16:26:23         self.assertEqual(4, target.complete_current_task.call_count)
16:26:23     
16:26:23         # awaiting responses of other clients
16:26:23         d.joinpoint_reached(
16:26:23             worker_id=1,
16:26:23             worker_local_timestamp=11,
16:26:23             task_allocations=[driver.ClientAllocation(client_id=1, task=driver.JoinPoint(id=0, clients_executing_completing_task=[0]))],
16:26:23         )
16:26:23     
16:26:23         self.assertEqual(-1, d.current_step)
16:26:23         self.assertEqual(2, len(d.workers_completed_current_step))
16:26:23     
16:26:23         d.joinpoint_reached(
16:26:23             worker_id=2,
16:26:23             worker_local_timestamp=12,
16:26:23             task_allocations=[driver.ClientAllocation(client_id=2, task=driver.JoinPoint(id=0, clients_executing_completing_task=[0]))],
16:26:23         )
16:26:23         self.assertEqual(-1, d.current_step)
16:26:23         self.assertEqual(3, len(d.workers_completed_current_step))
16:26:23     
16:26:23 >       d.joinpoint_reached(
16:26:23             worker_id=3,
16:26:23             worker_local_timestamp=13,
16:26:23             task_allocations=[driver.ClientAllocation(client_id=3, task=driver.JoinPoint(id=0, clients_executing_completing_task=[0]))],
16:26:23         )
16:26:23 
16:26:23 tests/driver/driver_test.py:226: 
16:26:23 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
16:26:23 esrally/driver/driver.py:752: in joinpoint_reached
16:26:23     self.move_to_next_task(workers_curr_step)
16:26:23 esrally/driver/driver.py:768: in move_to_next_task
16:26:23     m = self.metrics_store.to_externalizable(clear=True)
16:26:23 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
16:26:23 
16:26:23 self = <esrally.metrics.InMemoryMetricsStore object at 0x7f0b7ce452e0>
16:26:23 clear = True
16:26:23 
16:26:23     def to_externalizable(self, clear=False):
16:26:23         docs = self.docs
16:26:23         if clear:
16:26:23             self.docs = []
16:26:23 >       compressed = zlib.compress(pickle.dumps(docs))
16:26:23 E       _pickle.PicklingError: Can't pickle <class 'unittest.mock.MagicMock'>: it's not the same object as unittest.mock.MagicMock

Thanks to suggestion from Daniel
@pquentin pquentin force-pushed the master-node-telemetry branch from cc354c3 to 9de9e59 Compare September 20, 2021 11:54
@pquentin pquentin requested a review from b-deam September 21, 2021 09:35
The master-node-stats telemetry device regularly reports the name of the master node by using:

1. the `Cluster State API <https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state.html>`_, specifically `/_cluster/state/master_node` to determine the id of the master node
Copy link
Member

@b-deam b-deam Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The docs here are outdated now since we just use the cat client.

Though, I also wonder if we should even expose this in the docs given we intend for it to be an internal telemetry device that's always on? @danielmitterdorfer, thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should skip that for internal telemetry devices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I forgot to remove it when I updated the CLI output at the top of the page. Nice catch!

I removed it now.

Copy link
Member

@b-deam b-deam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it out and it works well! Thanks :-) Just one quick nit around the docs, and wondering whether we should expose it at all if we intend for it to be always on..

@pquentin pquentin requested a review from b-deam September 22, 2021 07:39
It's an internal telemetry device now.
Copy link
Member

@b-deam b-deam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This will be super handy when analysing benchmark results.

@pquentin pquentin merged commit b540f56 into elastic:master Sep 23, 2021
@pquentin pquentin deleted the master-node-telemetry branch September 23, 2021 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :Telemetry Telemetry Devices that gather additional metrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants