Create :'elasticsearch.perform_health_check' config #2369

kaylareopelle · 2023-12-15T17:41:28Z

To access the cluster name to assign to the database_name attribute on the Elasticsearch datastore segment, we currently make a request to the '_cluster/health' endpoint. This can cause CPU and network spikes when large clusters start.

To avoid this performance issue, we're introducing the :'elasticsearch.perform_health_check' config value. This value can be set to false to stop the agent from making requests to requests to '_cluster/health'.

When this config is false, the database_name attribute on all Elasticsearch datastore segments will be 'Unknown'.

Relates to: #2360

kaylareopelle · 2023-12-15T17:43:57Z

lib/new_relic/agent/configuration/default_source.rb

@@ -313,6 +313,7 @@ def self.enforce_fallback(allowed_values: nil, fallback: nil)
        'webpacker:compile'
      ].join(',').freeze

+      # rubocop:disable Metrics/CollectionLiteralLength


The NewRelic::Agent::Configuration::DEFAULTS hash now exceeds 250 lines, which violates the Metrics/CollectionLiteralLength rule. Moving DEFAULTS into a separate data storage file, like YAML or JSON would impact many other parts of our system, such as the generation of our documentation using HEREDOCs.

^-- This is a summary of a discussion with @fallwith and @hannahramadan about temporarily disabling this rule. Should it be included in a source code comment?

It seems fine to me, but i'm good with it either way!

Reminder for @hannahramadan to update #2367 to match

kaylareopelle · 2023-12-15T17:55:29Z

lib/new_relic/agent/instrumentation/elasticsearch/instrumentation.rb

+        @nr_cluster_name ||= if NewRelic::Agent.config[:'elasticsearch.perform_health_check']
+          perform_request('GET', '_cluster/health').body['cluster_name']
+        else
+          NewRelic::UNKNOWN


Technically the database_name attribute on a datastore segment can be nil. Would that make more sense here than Unknown? I think it's closer to the API's expectations.

Hmm I think one of the benefits to not using nil is that we won't have to check the config every single time. If we did set it to nil, then line 55 wouldn't return early anymore, so we'd have to do everything else every time and itd take a bit longer. Not sure if it would be a concerning increase though, or not.

That's right! Thanks for setting me straight :)

We could add another guard clause between lines 54 and 55 to check for the config and return early if it's false, but that might still be less performant than just setting "Unknown" since we'd be adding a hash lookup every time instead of a condition just once.

I think 'Unknown' vs nil is worth discussing for what gets sent up in the db span. Originally I felt that 'Unknown' would convey a failure to obtain the name or a deliberate short-circuit preventing the discovery, whereas nil would convey that the name is unset (given that Rails apps don't require the name in database.yml).

If nil is thought to be more appropriate, we can work around the performance implications. Perhaps defined? is faster than a hash lookup.

- @nr_cluster_name ||= if NewRelic::Agent.config[:'elasticsearch.perform_health_check'] + return @nr_cluster_name if defined?(@nr_cluster_name) + + @nr_cluster_name = if NewRelic::Agent.config[:'elasticsearch.perform_health_check']

start_datastore_segment's default arguments set database_name to nil, which leads me to believe that's a more standard value than Unknown.

The code change I alluded to slightly differs from what you've suggested above.
Using the method as it exists on dev, the change would be:

def nr_cluster_name + return unless NewRelic::Agent.config[:'elasticsearch.perform_health_check'] return @nr_cluster_name if @nr_cluster_name return if nr_hosts.empty? NewRelic::Agent.disable_all_tracing do @nr_cluster_name ||= perform_request('GET', '_cluster/health').body['cluster_name'] end rescue StandardError => e NewRelic::Agent.logger.error('Failed to get cluster name for elasticsearch', e) nil end

So the NewRelic::Agent.config check would occur every time #nr_cluster_name is called, instead of when the @nr_cluster_name variable is defined/assigned. I don't know if there's a way to make that config check more performant.

Ah, I was definitely thinking from the perspective of a change to config option only taking effect on a complete restart of the agent, so we'd only check the config hash one time and then either cache the string cluster name value or nil. And then use defined? for memoization help with nil. If you actually want to check the config freshly each time, then a hash lookup in a relatively small hash with symbols as keys shouldn't be too bad and the code as-is works great.

Ohh, I understand now. Thanks for clarifying! Your latest comment made me realize we don't need to check the config that often. It's reasonable to expect the config won't change during Elasticsearch client initialization. I was getting overly fixated on the test scenario where I needed to create a new client. Thank you!

tannalynn · 2023-12-15T18:27:01Z

lib/new_relic/agent/configuration/default_source.rb

+          :default => true,
+          :public => true,
+          :type => Boolean,
+          :allowed_from_server => true,


There is nothing on the server side that would update this config, so this should probably be false.

tannalynn · 2023-12-15T18:55:59Z

lib/new_relic/agent/instrumentation/elasticsearch/instrumentation.rb

+        @nr_cluster_name ||= if NewRelic::Agent.config[:'elasticsearch.perform_health_check']
+          perform_request('GET', '_cluster/health').body['cluster_name']
+        else
+          NewRelic::UNKNOWN


Hmm I think one of the benefits to not using nil is that we won't have to check the config every single time. If we did set it to nil, then line 55 wouldn't return early anymore, so we'd have to do everything else every time and itd take a bit longer. Not sure if it would be a concerning increase though, or not.

tannalynn · 2023-12-15T18:59:45Z

lib/new_relic/agent/configuration/default_source.rb

@@ -313,6 +313,7 @@ def self.enforce_fallback(allowed_values: nil, fallback: nil)
        'webpacker:compile'
      ].join(',').freeze

+      # rubocop:disable Metrics/CollectionLiteralLength


It seems fine to me, but i'm good with it either way!

lib/new_relic/agent/configuration/default_source.rb

kaylareopelle · 2023-12-15T20:07:12Z

lib/new_relic/agent/instrumentation/elasticsearch/instrumentation.rb

+        @nr_cluster_name ||= if NewRelic::Agent.config[:'elasticsearch.perform_health_check']
+          perform_request('GET', '_cluster/health').body['cluster_name']
+        else
+          NewRelic::UNKNOWN


That's right! Thanks for setting me straight :)

We could add another guard clause between lines 54 and 55 to check for the config and return early if it's false, but that might still be less performant than just setting "Unknown" since we'd be adding a hash lookup every time instead of a condition just once.

To access the cluster name to assign to the database_name attribute on the Elasticsearch datastore segment, we currently make a request to the '_cluster/health' endpoint. This can cause CPU and network spikes when large clusters start. To avoid this performance issue, the `:'elasticsearch.perform_health_check` config value can be set to false. This will set the database_name attribute on the Elasticsearch datastore segments to nil.

github-actions · 2023-12-16T01:00:15Z

SimpleCov Report

	Coverage	Threshold
Line	94.23%	94%
Branch	82.75%	82%

gremerritt · 2023-12-18T16:04:42Z

@kaylareopelle Out of curiosity, why are we using the _cluster/health endpoint to pull the cluster name? Am I reading it correctly that it's only for pulling the cluster name, and does not actually check the health (ie. it is not using wait_for_status)? Would it be possible to use a safer endpoint, like _cluster/stats with master:false (or /_info/_all on 8)?

edit: A colleague also just pointed out to me that simply hitting / provides the cluster name in the response.

kaylareopelle · 2023-12-19T23:26:37Z

Hi @gremerritt, those are great suggestions!

We considered trying out _cluster/stats, but hadn't noticed the master:false option. That seems like it'd be a great fit for the scenario in #2360. I see that you're from the same company as the reporter of #2360.

If we set that up in a branch, would you be willing to test it in a staging environment to see how that endpoint impacts your cluster?

We'd probably prefer to use _cluster/stats over _info/_all to have fewer conditions related to versions of Elasticsearch code, but if _info/_all is more performant, we'd probably default to that for users of version 8.

gremerritt · 2023-12-20T15:55:42Z

@kaylareopelle thanks for the update. Yes @erikkessler1 and I have been working together on this issue. Thanks for posting that branch, I think we'll have some bandwidth to give it a test.

From some more discussion on our team, it sounds like the "canonical" way to fetch the cluster name is by hitting the base / endpoint. For example, from the ruby client this is just client.info. So I'm quite sure that you could use the following for an even simpler approach:

@nr_cluster_name ||= perform_request('GET', '/').body['cluster_name']

We're happy to test out either. Thanks for digging into this!

kaylareopelle · 2023-12-20T23:53:48Z

Hi @gremerritt! Thanks for your response! I appreciate the work you and your team are doing to help us find the right endpoint!

Out of curiosity, what helped your team determine the base endpoint, /, was the canonical way to fetch the cluster name?

I think it's a great next test! I've created #2377 on branch test_root_for_cluster_name using the snippet you suggested to access the cluster name. Let me know how the performance testing goes! I'll post a similar update on #2360.

kaylareopelle · 2023-12-22T20:24:57Z

Closing in favor of #2377

kaylareopelle requested review from fallwith, hannahramadan and tannalynn as code owners December 15, 2023 17:41

kaylareopelle commented Dec 15, 2023

View reviewed changes

tannalynn reviewed Dec 15, 2023

View reviewed changes

kaylareopelle commented Dec 15, 2023

View reviewed changes

fallwith previously approved these changes Dec 15, 2023

View reviewed changes

kaylareopelle changed the title ~~Create elasticsearch.perform_health_check config~~ Create :'elasticsearch.perform_health_check' config Dec 15, 2023

kaylareopelle dismissed fallwith’s stale review via 9574c2a December 16, 2023 00:39

kaylareopelle force-pushed the elasticsearch-perform-health-check-config branch 3 times, most recently from 239ea65 to a330da8 Compare December 16, 2023 00:42

kaylareopelle force-pushed the elasticsearch-perform-health-check-config branch from a330da8 to b03ea44 Compare December 16, 2023 00:51

hannahramadan mentioned this pull request Dec 18, 2023

ViewComponent instrumentation #2367

Merged

This was referenced Dec 19, 2023

Use _cluster/stats to access cluster name #2374

Closed

Elasticsearch: Instrumentation adds potentially dangerous GET _cluster/health request to every initial request #2360

Closed

kaylareopelle marked this pull request as draft December 20, 2023 21:18

kaylareopelle closed this Dec 22, 2023

kaylareopelle deleted the elasticsearch-perform-health-check-config branch February 23, 2024 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create :'elasticsearch.perform_health_check' config #2369

Create :'elasticsearch.perform_health_check' config #2369

kaylareopelle commented Dec 15, 2023 •

edited

Loading

kaylareopelle Dec 15, 2023

kaylareopelle Dec 15, 2023

tannalynn Dec 15, 2023

fallwith Dec 15, 2023 •

edited

Loading

kaylareopelle Dec 15, 2023

tannalynn Dec 15, 2023

kaylareopelle Dec 15, 2023 •

edited

Loading

fallwith Dec 15, 2023

kaylareopelle Dec 15, 2023

fallwith Dec 16, 2023

kaylareopelle Dec 16, 2023

tannalynn Dec 15, 2023

tannalynn Dec 15, 2023

tannalynn Dec 15, 2023

kaylareopelle Dec 15, 2023 •

edited

Loading

github-actions bot commented Dec 16, 2023

gremerritt commented Dec 18, 2023 •

edited

Loading

kaylareopelle commented Dec 19, 2023

gremerritt commented Dec 20, 2023 •

edited

Loading

kaylareopelle commented Dec 20, 2023

kaylareopelle commented Dec 22, 2023

Create :'elasticsearch.perform_health_check' config #2369

Create :'elasticsearch.perform_health_check' config #2369

Conversation

kaylareopelle commented Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fallwith Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaylareopelle Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaylareopelle Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Dec 16, 2023

SimpleCov Report

gremerritt commented Dec 18, 2023 • edited Loading

kaylareopelle commented Dec 19, 2023

gremerritt commented Dec 20, 2023 • edited Loading

kaylareopelle commented Dec 20, 2023

kaylareopelle commented Dec 22, 2023

kaylareopelle commented Dec 15, 2023 •

edited

Loading

fallwith Dec 15, 2023 •

edited

Loading

kaylareopelle Dec 15, 2023 •

edited

Loading

kaylareopelle Dec 15, 2023 •

edited

Loading

gremerritt commented Dec 18, 2023 •

edited

Loading

gremerritt commented Dec 20, 2023 •

edited

Loading