Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x](backport #6619) enhancement(5423): added logic to replaces scheduler with long-wait scheduler in case of exceeded unauth response limit #6859

Open
wants to merge 1 commit into
base: 8.x
Choose a base branch
from

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Feb 13, 2025

  • Enhancement

What does this PR do?

Removes the forced unenroll from fleet gateway. Adds logic in the fleet gateway to switch out the scheduler used for checkins. If the unauthorized response limit is exceeded, a the scheduler is replaced with one that has a long wait duration. When the gateway receives a successful response, it switches back to using the regular scheduler with the shorter wait duration.

Why is it important?

Currently the agent unenrolls after 7 unauthorized error responses. This can causes problems in disaster recovery scenarios where users may have to manually intervene.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

None

How to test this PR locally

  • Create ESS deployment
  • Build the agent locally
  • Enroll the agent
  • In dev tools find the access token and delete it
GET /.security-7/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "name": "AGENT ID"
          }
        }
      ]
    }
  }
}
DELETE /_security/api_key
{
  "ids": ["KEY ID"]
}
  • Follow the agent logs sudo elastic-agent logs -f
  • After a while you will see retrieved an invalid api key error '10' times. will use long scheduler error message in the logs

Due to the backoff algorithm used, this test can take a long time. In order to see immediate results comment out the following code block

			if !bo.Wait() {
				if ctx.Err() != nil {
					// if the context is cancelled, break out of the loop
					break
				}

				// This should not really happen, but just in-case this error is used to show that
				// something strange occurred and we want to log it and report it.
				err := errors.New(
					"checkin retry loop was stopped",
					errors.TypeNetwork,
					errors.M(errors.MetaKeyURI, f.client.URI()),
				)

				f.log.Error(err)
				f.errCh <- err
				return nil, err
			}

Related issues


This is an automatic backport of pull request #6619 done by [Mergify](https://mergify.com).

…cheduler in case of exceeded unauth response limit (#6619)

* enhancement(5423): added logic to replaces scheduler with long-wait scheduler in case of exceeded unauth response limit

* enhancement(5423): removed default case from type switch, added unit tests

* enhancement(5423): added blackbox functional tests for gateway Run

* enhancement(5423): added changelog

* enhancement(5423): remove tryReplaceScheduler, update tests

* enhancement(5423): added SetDuration function, added mock scheduler to tests, simplified scheduler usage

(cherry picked from commit 99696b8)
@mergify mergify bot requested a review from a team as a code owner February 13, 2025 15:50
@mergify mergify bot added the backport label Feb 13, 2025
@mergify mergify bot requested review from blakerouse and pkoutsovasilis and removed request for a team February 13, 2025 15:50
@github-actions github-actions bot added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Feb 13, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants