-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry_bootstrap
: if true, the agent retries bootstrap with backoff
#4597
Add retry_bootstrap
: if true, the agent retries bootstrap with backoff
#4597
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @SilvaMatteus for this contribution.
I don't think that these changes will make any difference during the bootstrapping of agents.
The proposed changes in this PR set a configured timeout to the context when dialing to the server for many different operations. There are sync operations that can fail (syncing of entries and agent SVID rotation), which happen at scheduled intervals, so the retry logic is already in place to handle the failures. For that reason, we set the grpc.FailOnNonTempDialError(true)
option so we fail the connection during the initial dial on non-temporary dial errors and we just wait for the next sync. Due to that, the changes in this PR don't make any difference during the initial dialing, which is what we want to change in order to fix #4341. If we wouldn't have the FailOnNonTempDialError
option set, we would be piling up retries in each sync. Also, note that these changes would also affect when connecting to the server when dialing the server for calling all the RPCs for Agent, Bundle, Entry and SVID APIs.
The agent can crash at startup when failing to connect to the server if 1) needs to attest and can't attest, or 2) fails to synchronize entries during the initialization of the agent cache manager.
In order to fix #4341, I think that we should scope the changes only to retry on those operations when the agent is starting.
In the case of 2), the retry logic should probably retry the synchronize operation during initialization.
What we are trying to improve here is the resiliency when connecting to the server on agent startup, so we should check for the error code and not retry if the error is due to a permission denied situation (which means that it could connect to the server).
Also, it's probably better to have a boolean setting that enables this retry with backoff behavior during bootstrapping, instead of a configurable timeout that could be difficult to tune. Maybe we could come up with a timeout that could work in most cases? Maybe something between one and five minutes?
What do you think?
Hi @amartinezfayo, I was in wrong direction. Thanks for the detailed clarification. I will update the PR soon. |
8bebdf2
to
973a345
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @SilvaMatteus, thank you for updating the PR with the new approach. I left some comments.
78c07f0
to
ca2927b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @SilvaMatteus, thank you for updating the PR. I left some suggestions.
645229a
to
52c80da
Compare
retry_bootstrap
: if true, the agent retries bootstrap with backoff
52c80da
to
5a6282b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @SilvaMatteus for addressing the comments. I think that we are pretty close here. I've left a few more comments.
Signed-off-by: matteus <silvamatteus@lsd.ufcg.edu.br>
5a6282b
to
e6f4ecc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @SilvaMatteus!
Signed-off-by: matteus <silvamatteus@lsd.ufcg.edu.br>
Signed-off-by: matteus <silvamatteus@lsd.ufcg.edu.br>
This commit introduces the
retry_bootstrap
configurable. If true, the agent retries bootstrap with backoff.Pull Request check list
Affected functionality
Connection to the server for attestation and manager initialization.
Description of change
This commit introduces the
retry_bootstrap
configurable. If true, the agent retries bootstrap with backoff.Which issue this PR fixes
This PR adresses #4359.