Fix Fleet Enrollment Handling for Containerized Agent #6568

pkoutsovasilis · 2025-01-22T01:49:59Z

What does this PR do?

This PR introduces a fix for containerized Fleet-managed Elastic Agents to handle scenarios where:

The Fleet url changes.
The Fleet enrollment token changes.
The agent is unenrolled from Kibana Fleet.

It achieves the above by enhancing the agent's logic inside the container cmd to:

Verify enrollment conditions against stored state and only re-enroll when necessary.
Validate API token validity with Fleet, ensuring correct authentication even when the configuration hasn't changed.

The PR also adds a Kubernetes integration tests to verify the proper behaviour of enrollment and re-enrollment under various scenarios, including:

Re-deployment of agent with an updated fleet enrollment token.
Re-deployment of agent that is unenrolled from Kibana Fleet with the same enrollment token.
Re-deployment of agent across an older version and this one that ensures that no re-enrollment happens.

Key changes include:

Introduction of shouldFleetEnroll to centralize logic for fleet enrollment decisions.
Creation of PBKDF2-based enrollment token hashing to save in the agent state and be able to track token changes securely.
Utilise the agent/{id}/acks path of Fleet server with empty events to check if the agent API token is still valid. More than happy to introduce a separate path just for this cause on Fleet server, although it will be the same the ACKs one with empty events (as of now at least)

You can easily see here in the CI run of the first commit in this PR that the Elastic Agent, before this PR, doesn't handle enrollment correctly and always resorts to using the Fleet token and URL stored in its state. This PR addresses that issue and ensures enrollment uses the correct configuration and token.

PS: the actual changes of this PR are this commit b6596d0 which is +305 -25 thus I consider this PR aligned with the team policies 🙂

Why is it important?

This fix ensures robust and predictable behavior for containerized Fleet-managed Elastic Agents and enhance user experience of managing elastic-agent in Kubernetes.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

Disruptive User Impact

The changes introduced in this PR are non-disruptive. They improve fleet enrollment handling and maintain backward compatibility.

How to test this PR locally

mage integration:auth
PLATFORMS=linux/arm64 EXTERNAL=true SNAPSHOT=true PACKAGES=docker mage -v package 
INSTANCE_PROVISIONER=kind STACK_PROVISIONER=stateful K8S_VERSION=v1.31.1 SNAPSHOT=true mage integration:kubernetes

Related issues

Closes Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile #3586

…nt is unenrolled

internal/pkg/agent/cmd/container.go

blakerouse · 2025-01-22T19:17:00Z

internal/pkg/agent/cmd/container.go

+		// this elastic-agent has not valid api token thus it should enroll
+		return true, nil
+	case err != nil:
+		return false, fmt.Errorf("failed to validate api token: %w", err)


A simple network error could cause this, should there be some retries in here?

I added a really simple retry logic, let me know if that's aligned with what you had on mind 🙂

@blakerouse any objections merging this PR?

I am happy with the added retry loop. Would have been a little cleaner to place it in its own function, but that is okay. Feel free to merge.

internal/pkg/agent/cmd/container.go

elasticmachine · 2025-01-23T08:47:53Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

…update

pchila

LGTM
Just a couple of nitpicks on the usage of mocks in tests but that's not a blocker for merging

internal/pkg/agent/cmd/container_test.go

…update

…calls

...ontainerised-agent-when-there-is-an-enrollement-token-change-or-the-agent-is-unenrolled.yaml

blakerouse · 2025-01-28T15:19:14Z

internal/pkg/agent/cmd/container.go

+		// this elastic-agent has not valid api token thus it should enroll
+		return true, nil
+	case err != nil:
+		return false, fmt.Errorf("failed to validate api token: %w", err)


I am happy with the added retry loop. Would have been a little cleaner to place it in its own function, but that is okay. Feel free to merge.

elastic-sonarqube · 2025-01-28T17:02:54Z

Quality Gate passed

Issues
2 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
73.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

* feat: add k8s integration test to check fleet enrollment * fix: container correct fleet enrollment when token changes or the agent is unenrolled * fix: update TestDiagnosticLocalConfig to include enrollment_token_hash * feat: add a simple retry logic while validate the stored agent api token * feat: add unit-test for shouldFleetEnroll * fix: improve unit-test explicitness and check for expected number of calls * fix: kind in changelog fragment * fix: split up ack-ing fleet in a separate function (cherry picked from commit 17814cc)

* feat: add k8s integration test to check fleet enrollment * fix: container correct fleet enrollment when token changes or the agent is unenrolled * fix: update TestDiagnosticLocalConfig to include enrollment_token_hash * feat: add a simple retry logic while validate the stored agent api token * feat: add unit-test for shouldFleetEnroll * fix: improve unit-test explicitness and check for expected number of calls * fix: kind in changelog fragment * fix: split up ack-ing fleet in a separate function (cherry picked from commit 17814cc) Co-authored-by: Panos Koutsovasilis <panos.koutsovasilis@elastic.co>

internal/pkg/agent/cmd/container.go

pkoutsovasilis added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-8.x Automated backport to the 8.x branch with mergify labels Jan 22, 2025

pkoutsovasilis self-assigned this Jan 22, 2025

feat: add k8s integration test to check fleet enrollment

ab63682

pkoutsovasilis force-pushed the k8s/enrollment_token_update branch from f6d06ec to ab63682 Compare January 22, 2025 01:50

fix: container correct fleet enrollment when token changes or the age…

b6596d0

…nt is unenrolled

pkoutsovasilis changed the title ~~feat: add k8s integration test to check fleet enrollment~~ Fix Fleet Enrollment Handling for Containerized Agent Jan 22, 2025

fix: update TestDiagnosticLocalConfig to include enrollment_token_hash

e18b65a

blakerouse reviewed Jan 22, 2025

View reviewed changes

pkoutsovasilis force-pushed the k8s/enrollment_token_update branch 2 times, most recently from bd3acad to db8078b Compare January 23, 2025 06:39

pkoutsovasilis added 2 commits January 23, 2025 09:34

feat: add a simple retry logic while validate the stored agent api token

9506e8d

feat: add unit-test for shouldFleetEnroll

f7fd4cb

pkoutsovasilis force-pushed the k8s/enrollment_token_update branch from db8078b to f7fd4cb Compare January 23, 2025 07:34

pkoutsovasilis marked this pull request as ready for review January 23, 2025 08:47

pkoutsovasilis requested a review from a team as a code owner January 23, 2025 08:47

pkoutsovasilis requested review from andrzej-stencel and pchila January 23, 2025 08:47

Merge remote-tracking branch 'origin/main' into k8s/enrollment_token_…

7caaa26

…update

pchila approved these changes Jan 27, 2025

View reviewed changes

internal/pkg/agent/cmd/container_test.go Outdated Show resolved Hide resolved

internal/pkg/agent/cmd/container_test.go Outdated Show resolved Hide resolved

pkoutsovasilis added 2 commits January 28, 2025 07:31

Merge remote-tracking branch 'origin/main' into k8s/enrollment_token_…

09504d2

…update

fix: improve unit-test explicitness and check for expected number of …

6941dfc

…calls

andrzej-stencel reviewed Jan 28, 2025

View reviewed changes

...ontainerised-agent-when-there-is-an-enrollement-token-change-or-the-agent-is-unenrolled.yaml Outdated Show resolved Hide resolved

fix: kind in changelog fragment

5caae86

blakerouse approved these changes Jan 28, 2025

View reviewed changes

fix: split up ack-ing fleet in a separate function

a1beebe

pkoutsovasilis enabled auto-merge (squash) January 28, 2025 16:05

pkoutsovasilis merged commit 17814cc into elastic:main Jan 28, 2025
14 checks passed

mergify bot mentioned this pull request Jan 28, 2025

[8.x](backport #6568) Fix Fleet Enrollment Handling for Containerized Agent #6618

Merged

8 tasks

cmacknz reviewed Jan 29, 2025

View reviewed changes

internal/pkg/agent/cmd/container.go Show resolved Hide resolved

pkoutsovasilis mentioned this pull request Jan 30, 2025

[REQUEST]: Update documentation for fleet-managed agents that run inside a container (docker, k8s, etc.) elastic/ingest-docs#1662

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Fleet Enrollment Handling for Containerized Agent #6568

Fix Fleet Enrollment Handling for Containerized Agent #6568

pkoutsovasilis commented Jan 22, 2025 •

edited

Loading

blakerouse Jan 22, 2025

pkoutsovasilis Jan 23, 2025 •

edited

Loading

pkoutsovasilis Jan 28, 2025

blakerouse Jan 28, 2025

elasticmachine commented Jan 23, 2025

pchila left a comment

blakerouse Jan 28, 2025

elastic-sonarqube bot commented Jan 28, 2025

Fix Fleet Enrollment Handling for Containerized Agent #6568

Fix Fleet Enrollment Handling for Containerized Agent #6568

Conversation

pkoutsovasilis commented Jan 22, 2025 • edited Loading

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

blakerouse Jan 22, 2025

Choose a reason for hiding this comment

pkoutsovasilis Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

pkoutsovasilis Jan 28, 2025

Choose a reason for hiding this comment

blakerouse Jan 28, 2025

Choose a reason for hiding this comment

elasticmachine commented Jan 23, 2025

pchila left a comment

Choose a reason for hiding this comment

blakerouse Jan 28, 2025

Choose a reason for hiding this comment

elastic-sonarqube bot commented Jan 28, 2025

Quality Gate passed

pkoutsovasilis commented Jan 22, 2025 •

edited

Loading

pkoutsovasilis Jan 23, 2025 •

edited

Loading