sourcemap: failing to ping ES within 1s in metadata fetcher renders fetcher permanently broken #10338

axw · 2023-02-23T02:46:58Z

If the new Elasticsearch sourcemap fetcher fails to initialise, then it sets an error that is never cleared, and all future attempts fetch will fail with that error. Temporary issues, such as Elasticsearch being unavailable during startup, will render the fetcher permanently broken.

Moreover, initialisation involves making a successful request to Elasticsearch within 1 second, which may not be a sufficient amount of time when crossing a WAN:

apm-server/internal/sourcemap/metadata_fetcher.go

Lines 104 to 113 in c30c3ec

    
           ctx, cancel := context.WithTimeout(parent, 1*time.Second) 
        
           err := s.ping(ctx) 
        
           cancel() 
        
           if err != nil { 
        
           	// it is fine to not lock here since err will not access 
        
           	// initErr until the init channel is closed. 
        
           	s.initErr = fmt.Errorf("failed to ping es cluster: %w: %v", errFetcherUnvailable, err) 
        
           	s.logger.Error(s.initErr) 
        
           } else {

Do we need to check the initErr in the SourcemapFetcher? I'm thinking we should do this instead:

In MetadataESFetcher.startBackgroundSync: remove the ping, and change the code to close(s.init) after the first successful s.sync. It should not be closed immediately after the first attempt. Maybe remove the syncTimeout.
In SourcemapFetcher.Fetch, add an overall timeout (5s?) which applies to both waiting for the metadata fetcher to be initialised, and subsequently calling the backend fetcher's Fetch method.

That way the metadata fetcher will continue trying to sync in the background, and SourcemapFetcher.Fetch can avoid blocking without affecting the synchronisation.

The text was updated successfully, but these errors were encountered:

kruskall · 2023-02-24T01:34:45Z

If the new Elasticsearch sourcemap fetcher fails to initialise, then it sets an error that is never cleared, and all future attempts fetch will fail with that error. Temporary issues, such as Elasticsearch being unavailable during startup, will render the fetcher permanently broken.

I think I missed resetting the channel here:

apm-server/internal/sourcemap/metadata_fetcher.go

Line 140 in 27f0419

if err := s.sync(ctx); err != nil {

An error during init is not stopping the goroutine and the metadata fetcher is still fetching ES in the background periodically. The errored state is not updated, we should reset initErr to nil.

Moreover, initialisation involves making a successful request to Elasticsearch within 1 second, which may not be a sufficient amount of time when crossing a WAN:

Good point!
I did something similar in the chaned fetcher to add a 500ms context to fetch from the next fetcher:

apm-server/internal/sourcemap/chained.go

Line 50 in 27f0419

ctx, cancel = context.WithTimeout(context.Background(), 500*time.Millisecond)

To expand on the 1s context: there is a system test that should explain the issue pretty well: https://github.com/elastic/apm-server/blob/main/systemtest/sourcemap_test.go#L172-L176
When ES is unreachable the request will hang and the context will expire, preventing the kibana fallback from working.

The current implementation is falling back to the kibana fetcher in the following cases:

ES is not configured/available
the sourcemap index is missing
permission/auth errors when querying the sourcemaps

The reasoning behind the fallback being available and we not relying on the ES fetcher was to avoid breaking deployments and provide an upgrade path for people moving from the fleet fetcher.

Excluding the old ES fetcher which is incompatible with the new format and has been removed, here's what could happen:

deployment is using the fleet fetcher and kibana is not configured: upgrading would break so we should try the es fetcher (unlikely to work since it requires es being configured and a key with new privileges for the sourcemap index).
deployment is using the fleet fetcher with kibana is configured: upgrading should work because the kibana configuration is the same and we have the kibana fetcher as fallback.
deployment is using the kibana fetcher: upgrading should work because the kibana configuration is the same and we have the kibana fetcher as fallback.

NOTE: the above does not apply to ESS because the apikey is injected automatically so we take care of updating the configuration.

axw · 2023-02-24T04:48:18Z

OK, I think updating the error to nil would work.

Would this work?

Remove the 1s timeout in the metadata fetcher, just keep trying to sync in a loop. Until the first successful sync, the fetcher would be in an unavailable state (i.e. Fetch would return errFetcherUnvailable)
Do not block at all if the metadata fetcher hasn't yet successfully sync'd

WDYT? It does mean that source mapping wouldn't work during process startup, but IMO that's an acceptable tradeoff for the simplicity.

simitt · 2023-02-24T12:53:25Z

@kruskall could you pick this up please.

kruskall · 2023-02-28T00:45:13Z

Sorry for the late response, the final PR took a while. 🙇

OK, I think updating the error to nil would work.

Should be fixed by #10360

Remove the 1s timeout in the metadata fetcher, just keep trying to sync in a loop. Until the first successful sync, the fetcher would be in an unavailable state (i.e. Fetch would return errFetcherUnvailable)
Do not block at all if the metadata fetcher hasn't yet successfully sync'd

Should be fixed by #10367

hardcoded timeouts have been removed and the sourcemap_fetcher is not blocking anymore

WDYT? It does mean that source mapping wouldn't work during process startup, but IMO that's an acceptable tradeoff for the simplicity.

I agree the tradeoff is acceptable 👍

simitt · 2023-03-08T06:48:42Z

@kruskall please close this when fixed.

kruskall · 2023-03-08T09:35:42Z

All the PRs have been merged, I believe this is fixed.

cc @axw any concern in closing this ?

axw · 2023-03-08T09:55:44Z

Nope, thanks for the fixes!

axw added bug test-plan-regression v8.7.0 test-plan and removed test-plan-regression labels Feb 23, 2023

axw mentioned this issue Feb 23, 2023

feat: fetch sourcemaps from Elasticsearch #9722

Merged

3 tasks

simitt added this to the 8.7 milestone Feb 24, 2023

kruskall self-assigned this Feb 24, 2023

This was referenced Feb 27, 2023

fix: reset metadata fetcher init error after successful sync #10360

Merged

fix: remove hardcoded timeout and don't block when fetching sourcemaps #10367

Merged

axw closed this as completed Mar 8, 2023

axw removed the test-plan label Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sourcemap: failing to ping ES within 1s in metadata fetcher renders fetcher permanently broken #10338

sourcemap: failing to ping ES within 1s in metadata fetcher renders fetcher permanently broken #10338

axw commented Feb 23, 2023

kruskall commented Feb 24, 2023 •

edited

Loading

axw commented Feb 24, 2023

simitt commented Feb 24, 2023

kruskall commented Feb 28, 2023

simitt commented Mar 8, 2023

kruskall commented Mar 8, 2023

axw commented Mar 8, 2023

sourcemap: failing to ping ES within 1s in metadata fetcher renders fetcher permanently broken #10338

sourcemap: failing to ping ES within 1s in metadata fetcher renders fetcher permanently broken #10338

Comments

axw commented Feb 23, 2023

kruskall commented Feb 24, 2023 • edited Loading

axw commented Feb 24, 2023

simitt commented Feb 24, 2023

kruskall commented Feb 28, 2023

simitt commented Mar 8, 2023

kruskall commented Mar 8, 2023

axw commented Mar 8, 2023

kruskall commented Feb 24, 2023 •

edited

Loading