Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teleport signal handling and live reload. #1679

Merged
merged 1 commit into from
Feb 14, 2018
Merged

Teleport signal handling and live reload. #1679

merged 1 commit into from
Feb 14, 2018

Conversation

klizhentas
Copy link
Contributor

This commit introduces signal handling.
Parent teleport process is now capable of forking
the child process and passing listeners file descriptors
to the child.

Parent process then can gracefully shutdown
by tracking the amount of current connections and
closing listeners once the amount goes to 0.

Here are the signals handled:

  • USR2 signal will cause the parent to fork
    a child process and pass listener file descriptors to it.
    Child process will close unused file descriptors
    and will bind to the used ones.

At this moment two processes - the parent
and the forked child process will be serving requests.
After looking at the traffic and the log files,
administrator can either shut down the parent process
or the child process if the child process is not functioning
as expected.

  • TERM, INT signals will trigger graceful process shutdown.
    Auth, node and proxy processes will wait until the amount
    of active connections goes down to 0 and will exit after that.

  • KILL, QUIT signals will cause immediate non-graceful
    shutdown.

  • HUP signal combines USR2 and TERM signals in a convenient
    way: parent process will fork a child process and
    self-initate graceful shutdown. This is a more convenient
    than USR2/TERM sequence, but less agile and robust
    as if the connection to the parent process drops, but
    the new process exits with error, administrators
    can lock themselves out of the environment.

Additionally, boltdb backend has to be phased out,
as it does not support read/writes by two concurrent
processes. This had required refactoring of the dir
backend to use file locking to allow inter-process
collaboration on read/write operations.

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas klizhentas force-pushed the sasha/reload branch 2 times, most recently from 8bdde0e to b890087 Compare February 13, 2018 20:56
}
return trace.ConvertSystemError(err)
}
if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use writeLock() method from above?

return trace.ConvertSystemError(err)
}
defer f.Close()
if err := writeLock(f); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't this file just locked a couple of lines above?

const teleportFilesEnvVar = "TELEPORT_OS_FILES"

func execPath() (string, error) {
name, err := exec.LookPath(os.Args[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's std os.Executable() method that seems to do what you need.

@@ -194,7 +173,7 @@ func Run(options Options) (executedCommand string, conf *service.Config) {
if err != nil {
utils.FatalError(err)
}
log.Info("teleport: clean exit")
log.Debugf("Clean exit.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug.

if ccf.Gops {
log.Debugf("starting gops agent")
err := agent.Listen(&agent.Options{Addr: ccf.GopsAddr})
log.Debugf("Starting gops agent.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug.

}
warnOnErr(sshProxy.Close())
} else {
log.Infof("Shutting down gracefully.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Info.

log.Infof("Proxy service exited.")
defer listeners.Close()
if payload == nil {
log.Infof("Shutting down immediately.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Info.

}
log.Infof("Exited.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Info.

@@ -402,7 +402,7 @@ func (s *server) diffConns(newConns, existingConns map[string]services.TunnelCon
}

func (s *server) Wait() {
s.srv.Wait()
s.srv.Wait(context.TODO())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to have a caller pass context to the outer method (similar to Shutdown below)? Or is it too much refactoring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, no one is using it now, so I decided to pass

if err != nil {
err = trace.ConvertSystemError(err)
if !trace.IsNotFound(err) {
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace.Wrap(err).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConvertSystemError wraps

@klizhentas
Copy link
Contributor Author

For folks who are interested why are we using flock vs other types of locks, read this article:

https://gavv.github.io/blog/file-locks/

and this one is also fun:

http://0pointer.de/blog/projects/locking.html

for i := 0; i < attempts; i++ {
go func(cnt int) {
err := s.bk.UpsertVal(bucket, "key", []byte("new-value"), backend.Forever)
err := s.bk.UpsertVal(bucket, "key", []byte(value1), time.Hour)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch to create back

}
s.Infof("Shutdown: waiting for %v connections to finish.", activeConnections)
lastReport := time.Time{}
ticker := time.NewTicker(s.shutdownPollPeriod)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer ticker.Stop()

This commit introduces signal handling.
Parent teleport process is now capable of forking
the child process and passing listeners file descriptors
to the child.

Parent process then can gracefully shutdown
by tracking the amount of current connections and
closing listeners once the amount goes to 0.

Here are the signals handled:

* USR2 signal will cause the parent to fork
a child process and pass listener file descriptors to it.
Child process will close unused file descriptors
and will bind to the used ones.

At this moment two processes - the parent
and the forked child process will be serving requests.
After looking at the traffic and the log files,
administrator can either shut down the parent process
or the child process if the child process is not functioning
as expected.

* TERM, INT signals will trigger graceful process shutdown.
Auth, node and proxy processes will wait until the amount
of active connections goes down to 0 and will exit after that.

* KILL, QUIT signals will cause immediate non-graceful
shutdown.

* HUP signal combines USR2 and TERM signals in a convenient
way: parent process will fork a child process and
self-initate graceful shutdown. This is a more convenient
than USR2/TERM sequence, but less agile and robust
as if the connection to the parent process drops, but
the new process exits with error, administrators
can lock themselves out of the environment.

Additionally, boltdb backend has to be phased out,
as it does not support read/writes by two concurrent
processes. This had required refactoring of the dir
backend to use file locking to allow inter-process
collaboration on read/write operations.
@klizhentas klizhentas merged commit 650a222 into master Feb 14, 2018
@russjones russjones mentioned this pull request Feb 15, 2018
@klizhentas klizhentas mentioned this pull request Feb 19, 2018
@klizhentas klizhentas deleted the sasha/reload branch March 21, 2018 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants