-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sourcemap: failing to ping ES within 1s in metadata fetcher renders fetcher permanently broken #10338
Comments
I think I missed resetting the channel here:
An error during init is not stopping the goroutine and the metadata fetcher is still fetching ES in the background periodically. The errored state is not updated, we should reset
Good point! apm-server/internal/sourcemap/chained.go Line 50 in 27f0419
To expand on the 1s context: there is a system test that should explain the issue pretty well: https://github.com/elastic/apm-server/blob/main/systemtest/sourcemap_test.go#L172-L176 The current implementation is falling back to the kibana fetcher in the following cases:
The reasoning behind the fallback being available and we not relying on the ES fetcher was to avoid breaking deployments and provide an upgrade path for people moving from the fleet fetcher. Excluding the old ES fetcher which is incompatible with the new format and has been removed, here's what could happen:
NOTE: the above does not apply to ESS because the apikey is injected automatically so we take care of updating the configuration. |
OK, I think updating the error to nil would work. Would this work?
WDYT? It does mean that source mapping wouldn't work during process startup, but IMO that's an acceptable tradeoff for the simplicity. |
@kruskall could you pick this up please. |
Sorry for the late response, the final PR took a while. 🙇
Should be fixed by #10360
Should be fixed by #10367 hardcoded timeouts have been removed and the sourcemap_fetcher is not blocking anymore
I agree the tradeoff is acceptable 👍 |
@kruskall please close this when fixed. |
All the PRs have been merged, I believe this is fixed. cc @axw any concern in closing this ? |
Nope, thanks for the fixes! |
If the new Elasticsearch sourcemap fetcher fails to initialise, then it sets an error that is never cleared, and all future attempts fetch will fail with that error. Temporary issues, such as Elasticsearch being unavailable during startup, will render the fetcher permanently broken.
Moreover, initialisation involves making a successful request to Elasticsearch within 1 second, which may not be a sufficient amount of time when crossing a WAN:
apm-server/internal/sourcemap/metadata_fetcher.go
Lines 104 to 113 in c30c3ec
Do we need to check the initErr in the SourcemapFetcher? I'm thinking we should do this instead:
MetadataESFetcher.startBackgroundSync
: remove theping
, and change the code toclose(s.init)
after the first successfuls.sync
. It should not be closed immediately after the first attempt. Maybe remove thesyncTimeout
.SourcemapFetcher.Fetch
, add an overall timeout (5s?) which applies to both waiting for the metadata fetcher to be initialised, and subsequently calling the backend fetcher's Fetch method.That way the metadata fetcher will continue trying to sync in the background, and
SourcemapFetcher.Fetch
can avoid blocking without affecting the synchronisation.The text was updated successfully, but these errors were encountered: