Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(vault): let shdict secret vault cache presist enough time during resurrect_ttl #13471

Merged
merged 8 commits into from
Aug 23, 2024

Conversation

windmgc
Copy link
Member

@windmgc windmgc commented Aug 8, 2024

Summary

This PR fixes an issue that rotate_secret may flush a secret value with NEGATIVE_CACHED_VALUE when vault backend is down and a secret value stored in the shared dict has passed its ttl and hasn't finished consuming its resurrect_ttl.

TLDR; this issue happens easily when a reference is being used via the vault PDK function in custom codes(serverless functions, custom plugins, etc.), and some of the worker processes may not be triggered via the service/routes that use these custom codes, and these worker processes do not hold a valid LRU cache for the secret value

The issue was first reported in FTI-6137.

Checklist

  • The Pull Request has tests
  • A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

FTI-6137

@github-actions github-actions bot added core/pdk cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee labels Aug 8, 2024
@windmgc windmgc marked this pull request as ready for review August 14, 2024 08:29
kong/pdk/vault.lua Outdated Show resolved Hide resolved
@tzssangglass
Copy link
Member

LGTM

if ttl > resurrect_ttl then
return true

elseif ttl > SECRETS_CACHE_MIN_TTL then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm understanding this correctly, but why should ttl > SECRETS_CACHE_MIN_TTL here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the vault timer is refreshing all the secrets in shared dict with a period that equals to ROTATION_INTERVAL, and SECRETS_CACHE_MIN_TTL is 2 * ROTATION_INTERVAL. This ensures the execution of the secret refreshing will happen at least once within the last remaining time of SECRETS_CACHE_MIN_TTL, so that it can be refreshed either with a null placholder(NEGATIVELY_CACHED_VALUE) or a new updated value.

@AndyZhang0707 AndyZhang0707 added this to the 3.8.0 milestone Aug 19, 2024
kong/pdk/vault.lua Outdated Show resolved Hide resolved
@windmgc windmgc requested a review from catbro666 August 21, 2024 06:15
@bungle
Copy link
Member

bungle commented Aug 21, 2024

I will take a look on it.

Signed-off-by: Aapo Talvensaari <aapo.talvensaari@gmail.com>
kong/pdk/vault.lua Outdated Show resolved Hide resolved
kong/pdk/vault.lua Outdated Show resolved Hide resolved
@windmgc
Copy link
Member Author

windmgc commented Aug 21, 2024

@bungle I applied the reviews above, do you want to take another look?

It seems something is wrong with Github Action's worker, all tests are pending, we may need to wait until all tests passed

if resurrect and value == nil then
local resurrected_value = SECRETS_CACHE:get(cache_key)
if resurrected_value then
return resurrected_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a log here?

@windmgc windmgc merged commit 9269195 into master Aug 23, 2024
26 checks passed
@windmgc windmgc deleted the vault-shdict-resurrect-ttl branch August 23, 2024 03:13
github-actions bot pushed a commit that referenced this pull request Aug 23, 2024
…resurrect_ttl (#13471)

This PR fixes an issue that rotate_secret may flush a secret value with NEGATIVE_CACHED_VALUE when vault backend is down and a secret value stored in the shared dict has passed its ttl and hasn't finished consuming its resurrect_ttl.

TLDR; this issue happens easily when a reference is being used via the vault PDK function in custom codes(serverless functions, custom plugins, etc.), and some of the worker processes may not be triggered via the service/routes that use these custom codes, and these worker processes do not hold a valid LRU cache for the secret value

The issue was first reported in FTI-6137.

---------

Signed-off-by: Aapo Talvensaari <aapo.talvensaari@gmail.com>
Co-authored-by: Aapo Talvensaari <aapo.talvensaari@gmail.com>
(cherry picked from commit 9269195)
@team-gateway-bot
Copy link
Collaborator

@team-gateway-bot
Copy link
Collaborator

Successfully created cherry-pick PR for master:

windmgc added a commit that referenced this pull request Aug 23, 2024
…resurrect_ttl (#13471)

This PR fixes an issue that rotate_secret may flush a secret value with NEGATIVE_CACHED_VALUE when vault backend is down and a secret value stored in the shared dict has passed its ttl and hasn't finished consuming its resurrect_ttl.

TLDR; this issue happens easily when a reference is being used via the vault PDK function in custom codes(serverless functions, custom plugins, etc.), and some of the worker processes may not be triggered via the service/routes that use these custom codes, and these worker processes do not hold a valid LRU cache for the secret value

The issue was first reported in FTI-6137.

---------

Signed-off-by: Aapo Talvensaari <aapo.talvensaari@gmail.com>
Co-authored-by: Aapo Talvensaari <aapo.talvensaari@gmail.com>
(cherry picked from commit 9269195)
@team-gateway-bot
Copy link
Collaborator

Cherry-pick failed for master, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git remote add upstream https://github.com/kong/kong-ee
git fetch upstream master
git worktree add -d .worktree/cherry-pick-13471-to-master-to-upstream upstream/master
cd .worktree/cherry-pick-13471-to-master-to-upstream
git checkout -b cherry-pick-13471-to-master-to-upstream
ancref=$(git merge-base 9bc3deb30041dbc45b1fe44fe8069f21ab69caaa 30215e6ff8c7711162107c53a7252f9d78691ca6)
git cherry-pick -x $ancref..30215e6ff8c7711162107c53a7252f9d78691ca6

@github-actions github-actions bot added the incomplete-cherry-pick A cherry-pick was incomplete and needs manual intervention label Sep 14, 2024
github-actions bot pushed a commit that referenced this pull request Sep 14, 2024
…resurrect_ttl (#13471)

This PR fixes an issue that rotate_secret may flush a secret value with NEGATIVE_CACHED_VALUE when vault backend is down and a secret value stored in the shared dict has passed its ttl and hasn't finished consuming its resurrect_ttl.

TLDR; this issue happens easily when a reference is being used via the vault PDK function in custom codes(serverless functions, custom plugins, etc.), and some of the worker processes may not be triggered via the service/routes that use these custom codes, and these worker processes do not hold a valid LRU cache for the secret value

The issue was first reported in FTI-6137.

---------

Signed-off-by: Aapo Talvensaari <aapo.talvensaari@gmail.com>
Co-authored-by: Aapo Talvensaari <aapo.talvensaari@gmail.com>
(cherry picked from commit 9269195)
@team-gateway-bot
Copy link
Collaborator

github-actions bot pushed a commit that referenced this pull request Sep 14, 2024
…resurrect_ttl (#13471)

This PR fixes an issue that rotate_secret may flush a secret value with NEGATIVE_CACHED_VALUE when vault backend is down and a secret value stored in the shared dict has passed its ttl and hasn't finished consuming its resurrect_ttl.

TLDR; this issue happens easily when a reference is being used via the vault PDK function in custom codes(serverless functions, custom plugins, etc.), and some of the worker processes may not be triggered via the service/routes that use these custom codes, and these worker processes do not hold a valid LRU cache for the secret value

The issue was first reported in FTI-6137.

---------

Signed-off-by: Aapo Talvensaari <aapo.talvensaari@gmail.com>
Co-authored-by: Aapo Talvensaari <aapo.talvensaari@gmail.com>
(cherry picked from commit 9269195)
@team-gateway-bot
Copy link
Collaborator

ms2008 pushed a commit that referenced this pull request Sep 14, 2024
…resurrect_ttl (#13471)

This PR fixes an issue that rotate_secret may flush a secret value with NEGATIVE_CACHED_VALUE when vault backend is down and a secret value stored in the shared dict has passed its ttl and hasn't finished consuming its resurrect_ttl.

TLDR; this issue happens easily when a reference is being used via the vault PDK function in custom codes(serverless functions, custom plugins, etc.), and some of the worker processes may not be triggered via the service/routes that use these custom codes, and these worker processes do not hold a valid LRU cache for the secret value

The issue was first reported in FTI-6137.

---------

Signed-off-by: Aapo Talvensaari <aapo.talvensaari@gmail.com>
Co-authored-by: Aapo Talvensaari <aapo.talvensaari@gmail.com>
(cherry picked from commit 9269195)
@kikito kikito removed the incomplete-cherry-pick A cherry-pick was incomplete and needs manual intervention label Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants