-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VAULT-28192 fix Agent and Proxy consuming large amounts of CPU for auto-auth self-healing #27518
Conversation
…to-auth self-healing
CI Results: |
Build Results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
invalidTokenCh <- err | ||
} | ||
default: | ||
case err := <-ts.runner.ServerErrCh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is a possible scenario -
- IncomingToken receives a new token at the same time template sends an error back to ServerErrCh
- We reauthenticate first by honoring the ServerErrCh select first. Now IncomingCh has two values in the channel
- We try using the first token in IncomingCh but there is an error. Another error is sent to ServerErrCh
- Again the SeverErrCh is honored first and we reauthenticate.
We are now stuck in a loop where we always honor the token one behind valid token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's a good point! I do think it's likely in this scenario that both tokens will be valid, but it's still not a great state to be in. I'll rework this to drain the incoming channel in the same place we drain the invalid token channel. I think that should prevent any looping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely understand why we had it the way we had it before though, but I do think this might be the best fix, and the only situation it would struggle is if we have the two channels filled exactly simultaneously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this!! Thanks for adding Violet!
Description
Fixes an issue introduced in 1.17 where CPU usage in Agent and Proxy are extremely high due to the code taking the same path down a select statement repeatedly (in an infinite loop).
Will be backported to 1.17.
Fixes #27505
TODO only if you're a HashiCorp employee
getting backported to N-2, use the new style
backport/ent/x.x.x+ent
labelsinstead of the old style
backport/x.x.x
labels.the normal
backport/x.x.x
label (there should be only 1).of a public function, even if that change is in a CE file, double check that
applying the patch for this PR to the ENT repo and running tests doesn't
break any tests. Sometimes ENT only tests rely on public functions in CE
files.
in the PR description, commit message, or branch name.
description. Also, make sure the changelog is in this PR, not in your ENT PR.