Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourcemap: failing to ping ES within 1s in metadata fetcher renders fetcher permanently broken #10338

Closed
axw opened this issue Feb 23, 2023 · 7 comments
Assignees
Milestone

Comments

@axw
Copy link
Member

axw commented Feb 23, 2023

If the new Elasticsearch sourcemap fetcher fails to initialise, then it sets an error that is never cleared, and all future attempts fetch will fail with that error. Temporary issues, such as Elasticsearch being unavailable during startup, will render the fetcher permanently broken.

Moreover, initialisation involves making a successful request to Elasticsearch within 1 second, which may not be a sufficient amount of time when crossing a WAN:

ctx, cancel := context.WithTimeout(parent, 1*time.Second)
err := s.ping(ctx)
cancel()
if err != nil {
// it is fine to not lock here since err will not access
// initErr until the init channel is closed.
s.initErr = fmt.Errorf("failed to ping es cluster: %w: %v", errFetcherUnvailable, err)
s.logger.Error(s.initErr)
} else {

Do we need to check the initErr in the SourcemapFetcher? I'm thinking we should do this instead:

  • In MetadataESFetcher.startBackgroundSync: remove the ping, and change the code to close(s.init) after the first successful s.sync. It should not be closed immediately after the first attempt. Maybe remove the syncTimeout.
  • In SourcemapFetcher.Fetch, add an overall timeout (5s?) which applies to both waiting for the metadata fetcher to be initialised, and subsequently calling the backend fetcher's Fetch method.

That way the metadata fetcher will continue trying to sync in the background, and SourcemapFetcher.Fetch can avoid blocking without affecting the synchronisation.

@kruskall
Copy link
Member

kruskall commented Feb 24, 2023

If the new Elasticsearch sourcemap fetcher fails to initialise, then it sets an error that is never cleared, and all future attempts fetch will fail with that error. Temporary issues, such as Elasticsearch being unavailable during startup, will render the fetcher permanently broken.

I think I missed resetting the channel here:

if err := s.sync(ctx); err != nil {

An error during init is not stopping the goroutine and the metadata fetcher is still fetching ES in the background periodically. The errored state is not updated, we should reset initErr to nil.

Moreover, initialisation involves making a successful request to Elasticsearch within 1 second, which may not be a sufficient amount of time when crossing a WAN:

Good point!
I did something similar in the chaned fetcher to add a 500ms context to fetch from the next fetcher:

ctx, cancel = context.WithTimeout(context.Background(), 500*time.Millisecond)

To expand on the 1s context: there is a system test that should explain the issue pretty well: https://github.com/elastic/apm-server/blob/main/systemtest/sourcemap_test.go#L172-L176
When ES is unreachable the request will hang and the context will expire, preventing the kibana fallback from working.

The current implementation is falling back to the kibana fetcher in the following cases:

  • ES is not configured/available
  • the sourcemap index is missing
  • permission/auth errors when querying the sourcemaps

The reasoning behind the fallback being available and we not relying on the ES fetcher was to avoid breaking deployments and provide an upgrade path for people moving from the fleet fetcher.

Excluding the old ES fetcher which is incompatible with the new format and has been removed, here's what could happen:

  • deployment is using the fleet fetcher and kibana is not configured: upgrading would break so we should try the es fetcher (unlikely to work since it requires es being configured and a key with new privileges for the sourcemap index).
  • deployment is using the fleet fetcher with kibana is configured: upgrading should work because the kibana configuration is the same and we have the kibana fetcher as fallback.
  • deployment is using the kibana fetcher: upgrading should work because the kibana configuration is the same and we have the kibana fetcher as fallback.

NOTE: the above does not apply to ESS because the apikey is injected automatically so we take care of updating the configuration.

@axw
Copy link
Member Author

axw commented Feb 24, 2023

OK, I think updating the error to nil would work.

Would this work?

  • Remove the 1s timeout in the metadata fetcher, just keep trying to sync in a loop. Until the first successful sync, the fetcher would be in an unavailable state (i.e. Fetch would return errFetcherUnvailable)
  • Do not block at all if the metadata fetcher hasn't yet successfully sync'd

WDYT? It does mean that source mapping wouldn't work during process startup, but IMO that's an acceptable tradeoff for the simplicity.

@simitt simitt added this to the 8.7 milestone Feb 24, 2023
@simitt
Copy link
Contributor

simitt commented Feb 24, 2023

@kruskall could you pick this up please.

@kruskall
Copy link
Member

Sorry for the late response, the final PR took a while. 🙇

OK, I think updating the error to nil would work.

Should be fixed by #10360

Remove the 1s timeout in the metadata fetcher, just keep trying to sync in a loop. Until the first successful sync, the fetcher would be in an unavailable state (i.e. Fetch would return errFetcherUnvailable)
Do not block at all if the metadata fetcher hasn't yet successfully sync'd

Should be fixed by #10367

hardcoded timeouts have been removed and the sourcemap_fetcher is not blocking anymore

WDYT? It does mean that source mapping wouldn't work during process startup, but IMO that's an acceptable tradeoff for the simplicity.

I agree the tradeoff is acceptable 👍

@simitt
Copy link
Contributor

simitt commented Mar 8, 2023

@kruskall please close this when fixed.

@kruskall
Copy link
Member

kruskall commented Mar 8, 2023

All the PRs have been merged, I believe this is fixed.

cc @axw any concern in closing this ?

@axw
Copy link
Member Author

axw commented Mar 8, 2023

Nope, thanks for the fixes!

@axw axw closed this as completed Mar 8, 2023
@axw axw removed the test-plan label Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants