Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spire-agent: re-attest without restarting #4991

Merged
merged 6 commits into from
May 21, 2024
Merged

Conversation

sorindumitru
Copy link
Contributor

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Affected functionality
spire-agent behaviour in case of eviction or expired agent SVID

Description of change
When an agent is evicted it can re-attest to reconnect to spire-server but it currently needs to restart to do that. To avoid unavailability periods, which can lead to noticeable latency in workloads, reattest in process.

Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sorindumitru for this contribution!

return fmt.Errorf("unexpected value type: %T", r.state.Value())
}

if state.Reattestable && !r.c.DisableReattestToRenew {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably separate this in two different conditions and return separate errors so it's easier to understand and debug what happened.

if state.Reattestable && !r.c.DisableReattestToRenew {
err = r.reattest(ctx)
} else {
return fmt.Errorf("attestation method is not re-attestable or re-attestation disabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No formatting is needed here.

@azdagron azdagron added this to the 1.10.0 milestone May 2, 2024
m.deleteSVID()
return err
}
goto restart
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a goto, can you add a for before util.RunTasks?

for {
	err := util.RunTasks(ctx,
        ...	
}

@@ -204,9 +205,14 @@ func (m *manager) Run(ctx context.Context) error {
m.c.Log.Info("Cache manager stopped")
return nil
case nodeutil.ShouldAgentReattest(err):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: there is no unit test to verify this (the original ShouldAgentReattest),
so it is hard for me to ask you to add a unit test here...
in any case it is not a blocker but if you can add a unit test it will be great, if not we can add that later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can do something about this. The test doesn't use an attestor yet, is they some mock attestor available that can be used?

@@ -137,6 +138,29 @@ func (r *rotator) SetRotationFinishedHook(f func()) {
r.rotationFinishedHook = f
}

func (r *rotator) Reattest(ctx context.Context) (err error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add unit tests for this new function?

if !r.c.DisableReattestToRenew {
err = r.reattest(ctx)
} else {
return errors.New("re-attestation is disabled")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this error, since a user must DIsableReattesttion to get into this case...
Maybe it worth to have a single if?

if state.Reattestable && !r.c.DisableReattestToRenew {

and if that is the case this code is pretty much the same that is in rotateSVIDIfNeeded and we can expose that instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what i had before and @amartinezfayo asked me to change. Either way is fine with me.

rotateSVIDIfNeeded can't be used directly because that only rotates if the SVID is expired.

@@ -1,21 +0,0 @@
#!/bin/bash

log-debug "deleting agent..."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This IT is exercise:

  • evict agent
  • start again
  • evict again

So our current test case is to evict and verify that agent was re-atested, but agent never restarted, and keep alive with without timeout,
may we update this IT to verify that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some checks that the agents was able to re-attest. Note that I've had to make a bunch of changes in this test because it was unreliable on my machine. E.g. a server was brought up and then the test wouldn't wait for it to be available before trying to use it.

check-attested-agents

# spire-agent-a will re-attest but spire-agent-b won't because join_token implements trust on first use model.
AGENT_A_SPIFFE_ID_PATH="/spire/agent/x509pop/$(fingerprint conf/agent/agent.crt.pem)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this PATH at the start and use it to create AGENT_A_SPIFFE_ID?

@@ -78,16 +78,6 @@ docker-compose exec -T spire-server \
docker-compose exec -T spire-server \
/opt/spire/bin/spire-server agent evict -spiffeID "$agentID1" | grep "Agent evicted successfully" || fail-now "failed to evict agent 1"

# Verify agent list after evict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: if I follow this... this is not longer required because attested agent was using a reattestable attestor,
and now we evict but agent recovers itself,
however, may we add a test case with a no reattestable node? so we can verify that agent was really reattested?
or catch some logs to verify we had a successful reattestation?

When an agent is evicted it can re-attest to reconnect to spire-server but it currently needs to restart to do that. To avoid unavailability periods, which can lead to latency in applications, reattest in process

Signed-off-by: Sorin Dumitru <sdumitru@bloomberg.net>
Signed-off-by: Sorin Dumitru <sdumitru@bloomberg.net>
Signed-off-by: Sorin Dumitru <sdumitru@bloomberg.net>
Signed-off-by: Sorin Dumitru <sdumitru@bloomberg.net>
they didn't seem to need to be removed

Signed-off-by: Sorin Dumitru <sdumitru@bloomberg.net>
@MarcosDY MarcosDY merged commit e33fb84 into spiffe:main May 21, 2024
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants