Remove automatic unenrollment after 7 Fleet authentication failures #5428

cmacknz · 2024-09-04T19:25:58Z

Relates Agent comes back online after unenroll due to too many authentication failures #5433

Today Elastic Agent will unenroll itself automatically after receiving 7 consecutive 401 responses from Fleet when checking in. This was done to prevent agents that have been forced unenrolled (which revokes their API key) from checking in continuously until they can be re-installed.

elastic-agent/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

Lines 26 to 28 in 590c506

    
           // Max number of times an invalid API Key is checked 
        
           const maxUnauthCounter int = 6

elastic-agent/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

Lines 360 to 363 in 590c506

    
           // shouldUnenroll checks if the max number of trying an invalid key is reached 
        
           func (f *FleetGateway) shouldUnenroll() bool { 
        
           	return f.unauthCounter > maxUnauthCounter 
        
           }

elastic-agent/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

Lines 329 to 341 in 590c506

    
           resp, took, err := cmd.Execute(ctx, req) 
        
           if isUnauth(err) { 
        
           	f.unauthCounter++ 
        
           	if f.shouldUnenroll() { 
        
           		f.log.Warnf("retrieved an invalid api key error '%d' times. Starting to unenroll the elastic agent.", f.unauthCounter) 
        
           		return &fleetapi.CheckinResponse{ 
        
           			Actions: []fleetapi.Action{&fleetapi.ActionUnenroll{ActionID: "", ActionType: "UNENROLL", IsDetected: true}}, 
        
           		}, took, nil 
        
           	} 
        
           	return nil, took, err 
        
           }

This prevents force unenrolled agents from continuing to contact Fleet Server, but represents an edge case that can be hit in disaster recovery situations. To eliminate the chance that users recovering their cluster need to manually intervene on machines, we should stop unenrolling and instead greatly increase the checkin interval.

The initial proposal is that instead of unenrolling, we should switch to checking in once per hour. A successful checkin must return the agent to its original checkin interval.

elasticmachine · 2024-09-04T19:26:00Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Sep 4, 2024

cmacknz mentioned this issue Sep 5, 2024

Agent comes back online after unenroll due to too many authentication failures #5433

Closed

AndersonQ mentioned this issue Sep 6, 2024

Fix state store SetAction panic #5438

Merged

4 tasks

cmacknz mentioned this issue Sep 20, 2024

Return 503 Service Unavailable when unable to authenticate with Elasticsearch instead of 401 elastic/fleet-server#3929

Closed

jlind23 assigned kaanyalti Jan 20, 2025

kaanyalti mentioned this issue Jan 28, 2025

enhancement(5423): added logic to replaces scheduler with long-wait scheduler in case of exceeded unauth response limit #6619

Merged

5 tasks

kaanyalti closed this as completed in #6619 Feb 13, 2025

mergify bot mentioned this issue Feb 13, 2025

[8.x](backport #6619) enhancement(5423): added logic to replaces scheduler with long-wait scheduler in case of exceeded unauth response limit #6859

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove automatic unenrollment after 7 Fleet authentication failures #5428

Remove automatic unenrollment after 7 Fleet authentication failures #5428

cmacknz commented Sep 4, 2024 •

edited

Loading

elasticmachine commented Sep 4, 2024

Remove automatic unenrollment after 7 Fleet authentication failures #5428

Remove automatic unenrollment after 7 Fleet authentication failures #5428

Comments

cmacknz commented Sep 4, 2024 • edited Loading

elasticmachine commented Sep 4, 2024

cmacknz commented Sep 4, 2024 •

edited

Loading