Teleport signal handling and live reload. #1679

klizhentas · 2018-02-13T20:07:07Z

This commit introduces signal handling.
Parent teleport process is now capable of forking
the child process and passing listeners file descriptors
to the child.

Parent process then can gracefully shutdown
by tracking the amount of current connections and
closing listeners once the amount goes to 0.

Here are the signals handled:

USR2 signal will cause the parent to fork
a child process and pass listener file descriptors to it.
Child process will close unused file descriptors
and will bind to the used ones.

At this moment two processes - the parent
and the forked child process will be serving requests.
After looking at the traffic and the log files,
administrator can either shut down the parent process
or the child process if the child process is not functioning
as expected.

TERM, INT signals will trigger graceful process shutdown.
Auth, node and proxy processes will wait until the amount
of active connections goes down to 0 and will exit after that.
KILL, QUIT signals will cause immediate non-graceful
shutdown.
HUP signal combines USR2 and TERM signals in a convenient
way: parent process will fork a child process and
self-initate graceful shutdown. This is a more convenient
than USR2/TERM sequence, but less agile and robust
as if the connection to the parent process drops, but
the new process exits with error, administrators
can lock themselves out of the environment.

Additionally, boltdb backend has to be phased out,
as it does not support read/writes by two concurrent
processes. This had required refactoring of the dir
backend to use file locking to allow inter-process
collaboration on read/write operations.

klizhentas · 2018-02-13T20:28:36Z

retest this please

r0mant · 2018-02-13T21:13:57Z

lib/backend/dir/impl.go

+		}
+		return trace.ConvertSystemError(err)
+	}
+	if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {


Use writeLock() method from above?

r0mant · 2018-02-13T21:14:26Z

lib/backend/dir/impl.go

+		return trace.ConvertSystemError(err)
+	}
+	defer f.Close()
+	if err := writeLock(f); err != nil {


Wasn't this file just locked a couple of lines above?

r0mant · 2018-02-13T21:21:52Z

lib/service/signals.go

+const teleportFilesEnvVar = "TELEPORT_OS_FILES"
+
+func execPath() (string, error) {
+	name, err := exec.LookPath(os.Args[0])


There's std os.Executable() method that seems to do what you need.

r0mant · 2018-02-13T21:22:35Z

tool/teleport/common/teleport.go

@@ -194,7 +173,7 @@ func Run(options Options) (executedCommand string, conf *service.Config) {
 	if err != nil {
 		utils.FatalError(err)
 	}
-	log.Info("teleport: clean exit")
+	log.Debugf("Clean exit.")


r0mant · 2018-02-13T21:22:44Z

tool/teleport/common/teleport.go

 		if ccf.Gops {
-			log.Debugf("starting gops agent")
-			err := agent.Listen(&agent.Options{Addr: ccf.GopsAddr})
+			log.Debugf("Starting gops agent.")


r0mant · 2018-02-13T21:24:56Z

lib/service/service.go

+			}
+			warnOnErr(sshProxy.Close())
+		} else {
+			log.Infof("Shutting down gracefully.")


r0mant · 2018-02-13T21:25:00Z

lib/service/service.go

-		log.Infof("Proxy service exited.")
+		defer listeners.Close()
+		if payload == nil {
+			log.Infof("Shutting down immediately.")


r0mant · 2018-02-13T21:25:44Z

lib/service/service.go

 			}
+			log.Infof("Exited.")


r0mant · 2018-02-13T21:27:44Z

lib/reversetunnel/srv.go

@@ -402,7 +402,7 @@ func (s *server) diffConns(newConns, existingConns map[string]services.TunnelCon
 }

 func (s *server) Wait() {
-	s.srv.Wait()
+	s.srv.Wait(context.TODO())


Would it be better to have a caller pass context to the outer method (similar to Shutdown below)? Or is it too much refactoring?

yeah, no one is using it now, so I decided to pass

r0mant · 2018-02-13T21:29:08Z

lib/backend/dir/impl.go

+	if err != nil {
+		err = trace.ConvertSystemError(err)
+		if !trace.IsNotFound(err) {
+			return err


trace.Wrap(err).

ConvertSystemError wraps

klizhentas · 2018-02-13T21:59:23Z

For folks who are interested why are we using flock vs other types of locks, read this article:

https://gavv.github.io/blog/file-locks/

and this one is also fun:

http://0pointer.de/blog/projects/locking.html

klizhentas · 2018-02-13T22:04:28Z

lib/backend/dir/impl_test.go

 	for i := 0; i < attempts; i++ {
 		go func(cnt int) {
-			err := s.bk.UpsertVal(bucket, "key", []byte("new-value"), backend.Forever)
+			err := s.bk.UpsertVal(bucket, "key", []byte(value1), time.Hour)


switch to create back

klizhentas · 2018-02-13T22:19:24Z

lib/sshutils/server.go

+	}
+	s.Infof("Shutdown: waiting for %v connections to finish.", activeConnections)
+	lastReport := time.Time{}
+	ticker := time.NewTicker(s.shutdownPollPeriod)


defer ticker.Stop()

This commit introduces signal handling. Parent teleport process is now capable of forking the child process and passing listeners file descriptors to the child. Parent process then can gracefully shutdown by tracking the amount of current connections and closing listeners once the amount goes to 0. Here are the signals handled: * USR2 signal will cause the parent to fork a child process and pass listener file descriptors to it. Child process will close unused file descriptors and will bind to the used ones. At this moment two processes - the parent and the forked child process will be serving requests. After looking at the traffic and the log files, administrator can either shut down the parent process or the child process if the child process is not functioning as expected. * TERM, INT signals will trigger graceful process shutdown. Auth, node and proxy processes will wait until the amount of active connections goes down to 0 and will exit after that. * KILL, QUIT signals will cause immediate non-graceful shutdown. * HUP signal combines USR2 and TERM signals in a convenient way: parent process will fork a child process and self-initate graceful shutdown. This is a more convenient than USR2/TERM sequence, but less agile and robust as if the connection to the parent process drops, but the new process exits with error, administrators can lock themselves out of the environment. Additionally, boltdb backend has to be phased out, as it does not support read/writes by two concurrent processes. This had required refactoring of the dir backend to use file locking to allow inter-process collaboration on read/write operations.

klizhentas requested review from r0mant, kontsevoy and russjones February 13, 2018 20:07

klizhentas force-pushed the sasha/reload branch from 19578b5 to 909a554 Compare February 13, 2018 20:17

klizhentas force-pushed the sasha/reload branch 2 times, most recently from 8bdde0e to b890087 Compare February 13, 2018 20:56

r0mant reviewed Feb 13, 2018

View reviewed changes

klizhentas commented Feb 13, 2018

View reviewed changes

klizhentas force-pushed the sasha/reload branch from b890087 to 731ac3b Compare February 13, 2018 23:07

klizhentas force-pushed the sasha/reload branch from 731ac3b to 68b65f5 Compare February 13, 2018 23:18

russjones approved these changes Feb 14, 2018

View reviewed changes

klizhentas merged commit 650a222 into master Feb 14, 2018

russjones mentioned this pull request Feb 15, 2018

Key Generation Fixes #1686

Merged

klizhentas mentioned this pull request Feb 19, 2018

Changelog #1703

Closed

klizhentas deleted the sasha/reload branch March 21, 2018 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Teleport signal handling and live reload. #1679

Teleport signal handling and live reload. #1679

klizhentas commented Feb 13, 2018

klizhentas commented Feb 13, 2018

r0mant Feb 13, 2018

r0mant Feb 13, 2018

r0mant Feb 13, 2018

r0mant Feb 13, 2018

r0mant Feb 13, 2018

r0mant Feb 13, 2018

r0mant Feb 13, 2018

r0mant Feb 13, 2018

r0mant Feb 13, 2018

klizhentas Feb 13, 2018

r0mant Feb 13, 2018

klizhentas Feb 13, 2018

klizhentas commented Feb 13, 2018

klizhentas Feb 13, 2018

klizhentas Feb 13, 2018

Teleport signal handling and live reload. #1679

Teleport signal handling and live reload. #1679

Conversation

klizhentas commented Feb 13, 2018

klizhentas commented Feb 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klizhentas commented Feb 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment