Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry_bootstrap: if true, the agent retries bootstrap with backoff #4597

Merged

Conversation

SilvaMatteus
Copy link
Contributor

@SilvaMatteus SilvaMatteus commented Oct 21, 2023

This commit introduces the retry_bootstrap configurable. If true, the agent retries bootstrap with backoff.

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Affected functionality
Connection to the server for attestation and manager initialization.

Description of change
This commit introduces the retry_bootstrap configurable. If true, the agent retries bootstrap with backoff.

Which issue this PR fixes

This PR adresses #4359.

Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @SilvaMatteus for this contribution.

I don't think that these changes will make any difference during the bootstrapping of agents.
The proposed changes in this PR set a configured timeout to the context when dialing to the server for many different operations. There are sync operations that can fail (syncing of entries and agent SVID rotation), which happen at scheduled intervals, so the retry logic is already in place to handle the failures. For that reason, we set the grpc.FailOnNonTempDialError(true) option so we fail the connection during the initial dial on non-temporary dial errors and we just wait for the next sync. Due to that, the changes in this PR don't make any difference during the initial dialing, which is what we want to change in order to fix #4341. If we wouldn't have the FailOnNonTempDialError option set, we would be piling up retries in each sync. Also, note that these changes would also affect when connecting to the server when dialing the server for calling all the RPCs for Agent, Bundle, Entry and SVID APIs.

The agent can crash at startup when failing to connect to the server if 1) needs to attest and can't attest, or 2) fails to synchronize entries during the initialization of the agent cache manager.
In order to fix #4341, I think that we should scope the changes only to retry on those operations when the agent is starting.

In the case of 2), the retry logic should probably retry the synchronize operation during initialization.

What we are trying to improve here is the resiliency when connecting to the server on agent startup, so we should check for the error code and not retry if the error is due to a permission denied situation (which means that it could connect to the server).

Also, it's probably better to have a boolean setting that enables this retry with backoff behavior during bootstrapping, instead of a configurable timeout that could be difficult to tune. Maybe we could come up with a timeout that could work in most cases? Maybe something between one and five minutes?

What do you think?

pkg/agent/client/dial.go Outdated Show resolved Hide resolved
pkg/agent/client/dial.go Outdated Show resolved Hide resolved
@SilvaMatteus
Copy link
Contributor Author

Thank you @SilvaMatteus for this contribution.

I don't think that these changes will make any difference during the bootstrapping of agents. The proposed changes in this PR set a configured timeout to the context when dialing to the server for many different operations. There are sync operations that can fail (syncing of entries and agent SVID rotation), which happen at scheduled intervals, so the retry logic is already in place to handle the failures. For that reason, we set the grpc.FailOnNonTempDialError(true) option so we fail the connection during the initial dial on non-temporary dial errors and we just wait for the next sync. Due to that, the changes in this PR don't make any difference during the initial dialing, which is what we want to change in order to fix #4341. If we wouldn't have the FailOnNonTempDialError option set, we would be piling up retries in each sync. Also, note that these changes would also affect when connecting to the server when dialing the server for calling all the RPCs for Agent, Bundle, Entry and SVID APIs.

The agent can crash at startup when failing to connect to the server if 1) needs to attest and can't attest, or 2) fails to synchronize entries during the initialization of the agent cache manager. In order to fix #4341, I think that we should scope the changes only to retry on those operations when the agent is starting.

In the case of 2), the retry logic should probably retry the synchronize operation during initialization.

What we are trying to improve here is the resiliency when connecting to the server on agent startup, so we should check for the error code and not retry if the error is due to a permission denied situation (which means that it could connect to the server).

Also, it's probably better to have a boolean setting that enables this retry with backoff behavior during bootstrapping, instead of a configurable timeout that could be difficult to tune. Maybe we could come up with a timeout that could work in most cases? Maybe something between one and five minutes?

What do you think?

Hi @amartinezfayo,

I was in wrong direction. Thanks for the detailed clarification. I will update the PR soon.

@SilvaMatteus SilvaMatteus force-pushed the backoff-for-agent-to-server-conn branch from 8bebdf2 to 973a345 Compare January 6, 2024 23:19
@SilvaMatteus SilvaMatteus marked this pull request as draft January 6, 2024 23:19
Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @SilvaMatteus, thank you for updating the PR with the new approach. I left some comments.

go.mod Outdated Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
@SilvaMatteus SilvaMatteus force-pushed the backoff-for-agent-to-server-conn branch 2 times, most recently from 78c07f0 to ca2927b Compare January 27, 2024 23:36
pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SilvaMatteus, thank you for updating the PR. I left some suggestions.

pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Show resolved Hide resolved
@SilvaMatteus SilvaMatteus force-pushed the backoff-for-agent-to-server-conn branch 2 times, most recently from 645229a to 52c80da Compare February 3, 2024 18:16
@SilvaMatteus SilvaMatteus changed the title Add node_attestation_retry_timeout configurable for the agent and backoff for the attestation conn Add retry_bootstrap: if true, the agent retries bootstrap with backoff Feb 3, 2024
@SilvaMatteus SilvaMatteus marked this pull request as ready for review February 3, 2024 18:19
@SilvaMatteus SilvaMatteus force-pushed the backoff-for-agent-to-server-conn branch from 52c80da to 5a6282b Compare February 3, 2024 18:58
Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @SilvaMatteus for addressing the comments. I think that we are pretty close here. I've left a few more comments.

pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
Signed-off-by: matteus <silvamatteus@lsd.ufcg.edu.br>
@SilvaMatteus SilvaMatteus force-pushed the backoff-for-agent-to-server-conn branch from 5a6282b to e6f4ecc Compare February 20, 2024 09:54
Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @SilvaMatteus!

@amartinezfayo amartinezfayo added this to the 1.9.1 milestone Feb 20, 2024
@amartinezfayo amartinezfayo merged commit e88215b into spiffe:main Feb 20, 2024
32 checks passed
sriyer pushed a commit to spire-vault/spire that referenced this pull request Feb 23, 2024
Signed-off-by: matteus <silvamatteus@lsd.ufcg.edu.br>
rushi47 pushed a commit to rushi47/spire that referenced this pull request Apr 11, 2024
Signed-off-by: matteus <silvamatteus@lsd.ufcg.edu.br>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants