Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use windows console groups for process management #879

Merged
merged 2 commits into from
Jan 3, 2019

Conversation

lox
Copy link
Contributor

@lox lox commented Dec 20, 2018

It turns out Windows has no concept of parent/child processes, beyond storing the PID of the process that created it. This means that trying to walk a process tree to cancel a job can leave sub-processes orphaned if a subprocess in the hierarchy dies, as you can't walk that tree past the dead process. For instance in:

bash.exe (pid 1) -> bash.exe (pid 2) -> sleep.exe (pid 3)

If pid 2 dies, pid 3 remains, but killing pid 1 won't be able to traverse down to pid 3.

The best we can do is create processes inside a "console group" and then send break / ctrl-c events to that group.

Context:
#795
golang/dep#862
golang/go#17608
golang/go#6720
https://docs.microsoft.com/en-us/windows/console/generateconsolectrlevent
nodejs/node#3617

@lox
Copy link
Contributor Author

lox commented Dec 20, 2018

@amitsaha interested if you have any thoughts on this one!

@lox lox requested a review from sj26 December 20, 2018 02:30
@amitsaha
Copy link
Contributor

Hi @lox - thanks for looping me in. This definitely sounds like an improvement. I can do some testing on this as well, however that will have to wait till early January. I am keen to see how it plays with docker containers on Windows server on our buildkite agents.

@lox
Copy link
Contributor Author

lox commented Dec 20, 2018

Those are very good questions @amitsaha! I hadn't thought about docker interactions.

@amitsaha
Copy link
Contributor

Right now, the behavior that we see is that a job cancellation leaves the docker containers running and then we use a buildkite agent hook to kill any running containers before it starts a build (which spawns new containers).

I wonder if docker containers spawned from a process belong to the same console group - 'cause if that's the case, then we could improve the current situation by a lot!

@lox
Copy link
Contributor Author

lox commented Dec 20, 2018

@amitsaha I'd guess that is because under windows the bootstrap exits immediately on signal, so it doesn't have time to clean up. Whilst this PR doesn't do it, it would open the door of our bootstrap to catch the ctrl_c or break and run the pre-exit hooks. Exciting possibility if there aren't any major compatibility issues here.

@lox
Copy link
Contributor Author

lox commented Dec 20, 2018

Really good answer on how signal handling works in windows: https://stackoverflow.com/questions/35772001/how-to-handle-the-signal-in-python-on-windows-machine

@lox lox force-pushed the use-windows-console-groups-for-processes branch from 2325800 to d5b9f47 Compare January 3, 2019 00:41
@lox lox force-pushed the use-windows-console-groups-for-processes branch from a2021a2 to a6eb9db Compare January 3, 2019 04:47
@lox
Copy link
Contributor Author

lox commented Jan 3, 2019

So in Docker this results in an error for the Interrupt, but it will fall back to using TASKKILL.EXE. I think this is still a good improvement for folks that don't run the agent in windows docker.

docker/for-win#3173

@lox lox removed the wip label Jan 3, 2019
@lox lox merged commit f3203e4 into master Jan 3, 2019
@amitsaha
Copy link
Contributor

amitsaha commented Jan 6, 2019

So in Docker this results in an error for the Interrupt, but it will fall back to using TASKKILL.EXE. I think this is still a good improvement for folks that don't run the agent in windows docker.

docker/for-win#3173

that is interesting. My current use-case is slightly different:

  1. BK agent begins executing new job via PS script
  2. PS Script spawns docker containers
  3. ..

Job is manually cancelled. The spawned docker containers stay running. To clean up, we use a agent hook before the start of the next job to cleanup any running containers.

So, now after this change, I was hoping that since the new containers that are spawned by my PS script may be part of the same console group and hence the CTRL_BREAK_EVENT would also kill the containers. That however doesn't seem to be the case. I have a demo here. Any thoughts?

@lox
Copy link
Contributor Author

lox commented Jan 6, 2019

@amitsaha unfortunately docker containers don't inherit a console group. That would be too sensible a thing for windows to do. This does pave the way for windows things to run pre-exit scripts on cancellation though, which would mean you could use pre-exit hooks to cleanup like we do in the docker-compose plugin.

@amitsaha
Copy link
Contributor

amitsaha commented Jan 7, 2019

That would be too sensible a thing for windows to do.

Heh.

This does pave the way for windows things to run pre-exit scripts on cancellation though, which would mean you could use pre-exit hooks

Can you please explain this a bit more? Is the process group rooted at the bootstrap process? And if that's the case, then we can set up a signal handler in the bootstrap process to run the pre-exit handlers. If that's the case, is the change already in or would need to be done?

@lox
Copy link
Contributor Author

lox commented Jan 7, 2019

Can you please explain this a bit more?

Sure, I'm still a little bit blurry on the mechanics of windows terminal process groups, but basically we are telling windows to create a new process group for the agent bootstrap (which runs everything, and subprocesses will inherit that identifier). Cancellation happens by interrupting the bootstrap and then 10 seconds later killing it. Previously we didn't have any possible way of interrupting on windows in a way that could be caught and allow the bootstrap to cleanup (e.g upload artifacts and run pre-exit hooks). This process group change and the introduction of sending CTRL_BREAK_EVENT means that we are pretty close to having the bootstrap be able to catch that even and do it's cleanup on windows too. It might even just work right now, I'll have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants