Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Fleet Enrollment Handling for Containerized Agent #6568

Merged

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Jan 22, 2025

What does this PR do?

This PR introduces a fix for containerized Fleet-managed Elastic Agents to handle scenarios where:

  • The Fleet url changes.
  • The Fleet enrollment token changes.
  • The agent is unenrolled from Kibana Fleet.

It achieves the above by enhancing the agent's logic inside the container cmd to:

  • Verify enrollment conditions against stored state and only re-enroll when necessary.
  • Validate API token validity with Fleet, ensuring correct authentication even when the configuration hasn't changed.

The PR also adds a Kubernetes integration tests to verify the proper behaviour of enrollment and re-enrollment under various scenarios, including:

  • Re-deployment of agent with an updated fleet enrollment token.
  • Re-deployment of agent that is unenrolled from Kibana Fleet with the same enrollment token.
  • Re-deployment of agent across an older version and this one that ensures that no re-enrollment happens.

Key changes include:

  • Introduction of shouldFleetEnroll to centralize logic for fleet enrollment decisions.
  • Creation of PBKDF2-based enrollment token hashing to save in the agent state and be able to track token changes securely.
  • Utilise the agent/{id}/acks path of Fleet server with empty events to check if the agent API token is still valid. More than happy to introduce a separate path just for this cause on Fleet server, although it will be the same the ACKs one with empty events (as of now at least)

You can easily see here in the CI run of the first commit in this PR that the Elastic Agent, before this PR, doesn't handle enrollment correctly and always resorts to using the Fleet token and URL stored in its state. This PR addresses that issue and ensures enrollment uses the correct configuration and token.

PS: the actual changes of this PR are this commit b6596d0 which is +305 -25 thus I consider this PR aligned with the team policies 🙂

Why is it important?

This fix ensures robust and predictable behavior for containerized Fleet-managed Elastic Agents and enhance user experience of managing elastic-agent in Kubernetes.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

The changes introduced in this PR are non-disruptive. They improve fleet enrollment handling and maintain backward compatibility.

How to test this PR locally

mage integration:auth
PLATFORMS=linux/arm64 EXTERNAL=true SNAPSHOT=true PACKAGES=docker mage -v package 
INSTANCE_PROVISIONER=kind STACK_PROVISIONER=stateful K8S_VERSION=v1.31.1 SNAPSHOT=true mage integration:kubernetes

Related issues

@pkoutsovasilis pkoutsovasilis added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-8.x Automated backport to the 8.x branch with mergify labels Jan 22, 2025
@pkoutsovasilis pkoutsovasilis self-assigned this Jan 22, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the k8s/enrollment_token_update branch from f6d06ec to ab63682 Compare January 22, 2025 01:50
@pkoutsovasilis pkoutsovasilis changed the title feat: add k8s integration test to check fleet enrollment Fix Fleet Enrollment Handling for Containerized Agent Jan 22, 2025
// this elastic-agent has not valid api token thus it should enroll
return true, nil
case err != nil:
return false, fmt.Errorf("failed to validate api token: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple network error could cause this, should there be some retries in here?

Copy link
Contributor Author

@pkoutsovasilis pkoutsovasilis Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a really simple retry logic, let me know if that's aligned with what you had on mind 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blakerouse any objections merging this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with the added retry loop. Would have been a little cleaner to place it in its own function, but that is okay. Feel free to merge.

@pkoutsovasilis pkoutsovasilis force-pushed the k8s/enrollment_token_update branch 2 times, most recently from bd3acad to db8078b Compare January 23, 2025 06:39
@pkoutsovasilis pkoutsovasilis force-pushed the k8s/enrollment_token_update branch from db8078b to f7fd4cb Compare January 23, 2025 07:34
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review January 23, 2025 08:47
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner January 23, 2025 08:47
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Just a couple of nitpicks on the usage of mocks in tests but that's not a blocker for merging

// this elastic-agent has not valid api token thus it should enroll
return true, nil
case err != nil:
return false, fmt.Errorf("failed to validate api token: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with the added retry loop. Would have been a little cleaner to place it in its own function, but that is okay. Feel free to merge.

@pkoutsovasilis pkoutsovasilis enabled auto-merge (squash) January 28, 2025 16:05
@pkoutsovasilis pkoutsovasilis merged commit 17814cc into elastic:main Jan 28, 2025
14 checks passed
mergify bot pushed a commit that referenced this pull request Jan 28, 2025
* feat: add k8s integration test to check fleet enrollment

* fix: container correct fleet enrollment when token changes or the agent is unenrolled

* fix: update TestDiagnosticLocalConfig to include enrollment_token_hash

* feat: add a simple retry logic while validate the stored agent api token

* feat: add unit-test for shouldFleetEnroll

* fix: improve unit-test explicitness and check for expected number of calls

* fix: kind in changelog fragment

* fix: split up ack-ing fleet in a separate function

(cherry picked from commit 17814cc)
pkoutsovasilis added a commit that referenced this pull request Jan 28, 2025
* feat: add k8s integration test to check fleet enrollment

* fix: container correct fleet enrollment when token changes or the agent is unenrolled

* fix: update TestDiagnosticLocalConfig to include enrollment_token_hash

* feat: add a simple retry logic while validate the stored agent api token

* feat: add unit-test for shouldFleetEnroll

* fix: improve unit-test explicitness and check for expected number of calls

* fix: kind in changelog fragment

* fix: split up ack-ing fleet in a separate function

(cherry picked from commit 17814cc)

Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile
6 participants