Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create :'elasticsearch.perform_health_check' config #2369

Closed
wants to merge 1 commit into from

Conversation

kaylareopelle
Copy link
Contributor

@kaylareopelle kaylareopelle commented Dec 15, 2023

To access the cluster name to assign to the database_name attribute on the Elasticsearch datastore segment, we currently make a request to the '_cluster/health' endpoint. This can cause CPU and network spikes when large clusters start.

To avoid this performance issue, we're introducing the :'elasticsearch.perform_health_check' config value. This value can be set to false to stop the agent from making requests to requests to '_cluster/health'.

When this config is false, the database_name attribute on all Elasticsearch datastore segments will be 'Unknown'.

Relates to: #2360

@@ -313,6 +313,7 @@ def self.enforce_fallback(allowed_values: nil, fallback: nil)
'webpacker:compile'
].join(',').freeze

# rubocop:disable Metrics/CollectionLiteralLength
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NewRelic::Agent::Configuration::DEFAULTS hash now exceeds 250 lines, which violates the Metrics/CollectionLiteralLength rule. Moving DEFAULTS into a separate data storage file, like YAML or JSON would impact many other parts of our system, such as the generation of our documentation using HEREDOCs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^-- This is a summary of a discussion with @fallwith and @hannahramadan about temporarily disabling this rule. Should it be included in a source code comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems fine to me, but i'm good with it either way!

Copy link
Contributor

@fallwith fallwith Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder for @hannahramadan to update #2367 to match

@nr_cluster_name ||= if NewRelic::Agent.config[:'elasticsearch.perform_health_check']
perform_request('GET', '_cluster/health').body['cluster_name']
else
NewRelic::UNKNOWN
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically the database_name attribute on a datastore segment can be nil. Would that make more sense here than Unknown? I think it's closer to the API's expectations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think one of the benefits to not using nil is that we won't have to check the config every single time. If we did set it to nil, then line 55 wouldn't return early anymore, so we'd have to do everything else every time and itd take a bit longer. Not sure if it would be a concerning increase though, or not.

Copy link
Contributor Author

@kaylareopelle kaylareopelle Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right! Thanks for setting me straight :)

We could add another guard clause between lines 54 and 55 to check for the config and return early if it's false, but that might still be less performant than just setting "Unknown" since we'd be adding a hash lookup every time instead of a condition just once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'Unknown' vs nil is worth discussing for what gets sent up in the db span. Originally I felt that 'Unknown' would convey a failure to obtain the name or a deliberate short-circuit preventing the discovery, whereas nil would convey that the name is unset (given that Rails apps don't require the name in database.yml).

If nil is thought to be more appropriate, we can work around the performance implications. Perhaps defined? is faster than a hash lookup.

- @nr_cluster_name ||= if NewRelic::Agent.config[:'elasticsearch.perform_health_check']
+ return @nr_cluster_name if defined?(@nr_cluster_name)
+
+ @nr_cluster_name = if NewRelic::Agent.config[:'elasticsearch.perform_health_check']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_datastore_segment's default arguments set database_name to nil, which leads me to believe that's a more standard value than Unknown.

The code change I alluded to slightly differs from what you've suggested above.
Using the method as it exists on dev, the change would be:

    def nr_cluster_name
+    return unless NewRelic::Agent.config[:'elasticsearch.perform_health_check']
      return @nr_cluster_name if @nr_cluster_name
      return if nr_hosts.empty?

      NewRelic::Agent.disable_all_tracing do
        @nr_cluster_name ||= perform_request('GET', '_cluster/health').body['cluster_name']
      end
    rescue StandardError => e
      NewRelic::Agent.logger.error('Failed to get cluster name for elasticsearch', e)
      nil
    end

So the NewRelic::Agent.config check would occur every time #nr_cluster_name is called, instead of when the @nr_cluster_name variable is defined/assigned. I don't know if there's a way to make that config check more performant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was definitely thinking from the perspective of a change to config option only taking effect on a complete restart of the agent, so we'd only check the config hash one time and then either cache the string cluster name value or nil. And then use defined? for memoization help with nil. If you actually want to check the config freshly each time, then a hash lookup in a relatively small hash with symbols as keys shouldn't be too bad and the code as-is works great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I understand now. Thanks for clarifying! Your latest comment made me realize we don't need to check the config that often. It's reasonable to expect the config won't change during Elasticsearch client initialization. I was getting overly fixated on the test scenario where I needed to create a new client. Thank you!

:default => true,
:public => true,
:type => Boolean,
:allowed_from_server => true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing on the server side that would update this config, so this should probably be false.

@nr_cluster_name ||= if NewRelic::Agent.config[:'elasticsearch.perform_health_check']
perform_request('GET', '_cluster/health').body['cluster_name']
else
NewRelic::UNKNOWN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think one of the benefits to not using nil is that we won't have to check the config every single time. If we did set it to nil, then line 55 wouldn't return early anymore, so we'd have to do everything else every time and itd take a bit longer. Not sure if it would be a concerning increase though, or not.

@@ -313,6 +313,7 @@ def self.enforce_fallback(allowed_values: nil, fallback: nil)
'webpacker:compile'
].join(',').freeze

# rubocop:disable Metrics/CollectionLiteralLength
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems fine to me, but i'm good with it either way!

lib/new_relic/agent/configuration/default_source.rb Outdated Show resolved Hide resolved
@nr_cluster_name ||= if NewRelic::Agent.config[:'elasticsearch.perform_health_check']
perform_request('GET', '_cluster/health').body['cluster_name']
else
NewRelic::UNKNOWN
Copy link
Contributor Author

@kaylareopelle kaylareopelle Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right! Thanks for setting me straight :)

We could add another guard clause between lines 54 and 55 to check for the config and return early if it's false, but that might still be less performant than just setting "Unknown" since we'd be adding a hash lookup every time instead of a condition just once.

fallwith
fallwith previously approved these changes Dec 15, 2023
@kaylareopelle kaylareopelle changed the title Create elasticsearch.perform_health_check config Create :'elasticsearch.perform_health_check' config Dec 15, 2023
@kaylareopelle kaylareopelle force-pushed the elasticsearch-perform-health-check-config branch 3 times, most recently from 239ea65 to a330da8 Compare December 16, 2023 00:42
To access the cluster name to assign to the
database_name attribute on the Elasticsearch datastore segment,
we currently make a request to the '_cluster/health' endpoint. This can
cause CPU and network spikes when large clusters start.
To avoid this performance issue, the
`:'elasticsearch.perform_health_check` config value can be set to false.
This will set the database_name attribute on the Elasticsearch datastore
segments to nil.
@kaylareopelle kaylareopelle force-pushed the elasticsearch-perform-health-check-config branch from a330da8 to b03ea44 Compare December 16, 2023 00:51
Copy link

SimpleCov Report

Coverage Threshold
Line 94.23% 94%
Branch 82.75% 82%

@gremerritt
Copy link

gremerritt commented Dec 18, 2023

@kaylareopelle Out of curiosity, why are we using the _cluster/health endpoint to pull the cluster name? Am I reading it correctly that it's only for pulling the cluster name, and does not actually check the health (ie. it is not using wait_for_status)? Would it be possible to use a safer endpoint, like _cluster/stats with master:false (or /_info/_all on 8)?

edit: A colleague also just pointed out to me that simply hitting / provides the cluster name in the response.

@kaylareopelle
Copy link
Contributor Author

Hi @gremerritt, those are great suggestions!

We considered trying out _cluster/stats, but hadn't noticed the master:false option. That seems like it'd be a great fit for the scenario in #2360. I see that you're from the same company as the reporter of #2360.

If we set that up in a branch, would you be willing to test it in a staging environment to see how that endpoint impacts your cluster?

We'd probably prefer to use _cluster/stats over _info/_all to have fewer conditions related to versions of Elasticsearch code, but if _info/_all is more performant, we'd probably default to that for users of version 8.

@gremerritt
Copy link

gremerritt commented Dec 20, 2023

@kaylareopelle thanks for the update. Yes @erikkessler1 and I have been working together on this issue. Thanks for posting that branch, I think we'll have some bandwidth to give it a test.

From some more discussion on our team, it sounds like the "canonical" way to fetch the cluster name is by hitting the base / endpoint. For example, from the ruby client this is just client.info. So I'm quite sure that you could use the following for an even simpler approach:

@nr_cluster_name ||= perform_request('GET', '/').body['cluster_name']

We're happy to test out either. Thanks for digging into this!

@kaylareopelle kaylareopelle marked this pull request as draft December 20, 2023 21:18
@kaylareopelle
Copy link
Contributor Author

Hi @gremerritt! Thanks for your response! I appreciate the work you and your team are doing to help us find the right endpoint!

Out of curiosity, what helped your team determine the base endpoint, /, was the canonical way to fetch the cluster name?

I think it's a great next test! I've created #2377 on branch test_root_for_cluster_name using the snippet you suggested to access the cluster name. Let me know how the performance testing goes! I'll post a similar update on #2360.

@kaylareopelle
Copy link
Contributor Author

Closing in favor of #2377

@kaylareopelle kaylareopelle deleted the elasticsearch-perform-health-check-config branch February 23, 2024 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants