Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bin): Add graceful shutdown for helm command #9520

Merged
merged 2 commits into from
Nov 25, 2024

Conversation

idsulik
Copy link
Contributor

@idsulik idsulik commented Sep 6, 2024

Description
Added graceful shutdown for helm command

the CommandContext kills the process when we cancel the context
image

it leads to an inconsistent state in helm charts, the helm left chart in "pending-install" state.
image

How to reproduce the issue:

  1. start a deployment using the helm
  2. in the middle of the installation stop the process using Ctrl/CMD + C
  3. check the release status using helm status {releaseName}, the status will be "pending-install"
  4. try to deploy again and you won't be able to do it, because helm will return helm another operation (install/upgrade/rollback) is in progress

This fix gives the helm 15 seconds to rollback(if you pass --atomic) and/or save the actual(failed) state of the release

@idsulik
Copy link
Contributor Author

idsulik commented Nov 18, 2024

Hi @plumpy, please check the PR, because it's very important, especially with these changes #9451, because if a user executes parallel package installation and interrupts it, all the packages will remain in pending state, because skaffold kills the helm command execution on context cancellation.
Thank you!

@plumpy
Copy link
Collaborator

plumpy commented Nov 23, 2024

Okay, I played around with this some. For testing, I made my own little binary that waits for a SIGINT, then waits around another n seconds before exiting.

From playing around with that, it seems to me like you can just use the built-in support in the cmd package, rather than rolling your own goroutine:

cmd := exec.CommandContext(ctx, "helm", args...)
cmd.Cancel = func() error {
  fmt.Println("Terminating helm, giving it 2 minutes to clean up...")
  return cmd.Process.Signal(os.Interrupt)
}
cmd.WaitDelay = 120 * time.Second

I tried a variety of combinations, where the process is still running after WaitDelay (it gets sent a SIGKILL) or where the process exits before the WaitDelay (Skaffold exits as soon as it does, not waiting around the full WaitDelay time), so it seems like it does what you want. Can you give that a shot and see if it works for you?

Signed-off-by: Suleiman Dibirov <idsulik@gmail.com>
Signed-off-by: Suleiman Dibirov <idsulik@gmail.com>
@idsulik idsulik force-pushed the helm-graceful-shutdown branch from a09c6eb to 3a81af0 Compare November 23, 2024 16:45
@idsulik
Copy link
Contributor Author

idsulik commented Nov 23, 2024

@plumpy you're right, it's a more concise way to do what I wanted, good job, I didn't know about the WaitDelay.
I've just tried it several times with concurrent installation(#9451) and it looks like it works.

thank you!

p.s. Is there an approximate time when the next release will be published?

@plumpy
Copy link
Collaborator

plumpy commented Nov 25, 2024

p.s. Is there an approximate time when the next release will be published?

Hopefully relatively soon, but we need to figure out why the internal presubmits are failing.

@idsulik
Copy link
Contributor Author

idsulik commented Nov 25, 2024

@plumpy if you mean this error

2024/11/22 16:25:23 error signing scorecard results: getting key from Fulcio: retrieving cert: error obtaining token: expired_token
2024/11/22 16:25:23 retrying in 10s...
2024/11/22 16:25:33 error signing scorecard json results: error signing payload: getting key from Fulcio: retrieving cert: error obtaining token: expired_token

(c) https://github.com/GoogleContainerTools/skaffold/actions/runs/11976028911/job/33390752067

then it looks like the secret SCORECARD_READ_TOKEN should be updated with a new token because the old one has expired

@plumpy
Copy link
Collaborator

plumpy commented Nov 25, 2024

if you mean this error

No, not that error, we need to fix that too, but I'm less concerned about that. I'm talking about the kokoro errors, which is what actually builds, signs, and publishes the release. You won't have access to those logs unfortunately.

@plumpy plumpy merged commit d53620f into GoogleContainerTools:main Nov 25, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants