Add config, logging for healthcheck #2998

dchw · 2022-08-02T23:44:55Z

This adds configuration values, and logging for the gRPC healthcheck ping. By adjusting these values larger, you can get builds with many/large files to work in a bandwidth-starved environment.

Sample configuration section:

[health]
  frequency = "10s"
  timeout = "1m"
  allowedFailures = 3

This config is completely optional, and leaving it blank (or omitting it) maintains the old behavior.

It also adds logging. A debug message for each completed healthcheck, and a warning for each failure. It also prints when it cancels the context, so if you do have odd cancellations in your use case, they have a chance at being traced back to here. Here is a small log snippet with some failing and recovering happening:

time="2022-08-02T22:28:02Z" level=debug msg="diff applied" d=91.038907ms digest="sha256:ab6db1bc80d0a6df92d04c3fad44b9443642fbc85878023bc8c011763fe44524" media=application/vnd.docker.image.rootfs.diff.tar.gzip size=2814645
time="2022-08-02T22:28:03Z" level=debug msg="healthcheck completed" actualDuration="734.564µs" timeout=17s
time="2022-08-02T22:28:04Z" level=debug msg="healthcheck completed" actualDuration="814.28µs" timeout=17s
time="2022-08-02T22:28:05Z" level=debug msg="healthcheck completed" actualDuration="720.232µs" timeout=17s
time="2022-08-02T22:28:06Z" level=debug msg="healthcheck completed" actualDuration=2.065139ms timeout=17s
time="2022-08-02T22:28:07Z" level=debug msg="healthcheck completed" actualDuration=2.041702ms timeout=17s
time="2022-08-02T22:28:07Z" level=debug msg="new ref for local: a5224clhf1nesi63z3zsno1zs" span="[context .] local context ."
time="2022-08-02T22:28:25Z" level=warning msg="healthcheck failed" actualDuration=17.000346522s allowedFailures=3 consecutiveFailures=1 timeout=17s
time="2022-08-02T22:28:42Z" level=warning msg="healthcheck failed" actualDuration=17.000424726s allowedFailures=3 consecutiveFailures=2 timeout=17s
time="2022-08-02T22:28:58Z" level=debug msg="healthcheck completed" actualDuration=16.496519198s timeout=17s
time="2022-08-02T22:29:15Z" level=warning msg="healthcheck failed" actualDuration=17.000784359s allowedFailures=3 consecutiveFailures=1 timeout=17s
time="2022-08-02T22:29:32Z" level=warning msg="healthcheck failed" actualDuration=17.000419603s allowedFailures=3 consecutiveFailures=2 timeout=17s
time="2022-08-02T22:29:48Z" level=debug msg="healthcheck completed" actualDuration=15.841662313s timeout=17s
time="2022-08-02T22:30:05Z" level=warning msg="healthcheck failed" actualDuration=17.000593778s allowedFailures=3 consecutiveFailures=1 timeout=17s
time="2022-08-02T22:30:22Z" level=warning msg="healthcheck failed" actualDuration=17.001052249s allowedFailures=3 consecutiveFailures=2 timeout=17s
time="2022-08-02T22:30:38Z" level=debug msg="healthcheck completed" actualDuration=16.517479546s timeout=17s
time="2022-08-02T22:30:55Z" level=warning msg="healthcheck failed" actualDuration=17.000286526s allowedFailures=3 consecutiveFailures=1 timeout=17s
time="2022-08-02T22:31:12Z" level=warning msg="healthcheck failed" actualDuration=17.000703879s allowedFailures=3 consecutiveFailures=2 timeout=17s
time="2022-08-02T22:31:29Z" level=warning msg="healthcheck failed" actualDuration=17.000178969s allowedFailures=3 consecutiveFailures=3 timeout=17s
time="2022-08-02T22:31:29Z" level=error msg="healthcheck failed too many times"

dchw · 2022-08-02T23:48:10Z

Went to draft to address some test failures. Feedback still welcome.

tonistiigi

I'd prefer not to leave it configurable. Users don't want this to be configurable, they just want it to work. And if we can't find the correct configuration then they can't either.

I'd suggest:

Bump frequency to 5 sec
Timeout 30-last*1.5 (depending on the time of the last successful health check)
Allow one failure
Every failure is logged, fatal failure logged differently

tonistiigi · 2022-08-08T06:09:34Z

@dchw Any update?

dchw · 2022-08-09T20:16:26Z

I'd prefer not to leave it configurable. Users don't want this to be configurable, they just want it to work.

I 100% agree. Just works is a great default.

And if we can't find the correct configuration then they can't either.

Not sure this is true. You can certainly target a particular lowest-common-denominator. But without configuration, there are cases at the edge that cannot be supported. Does buildkit have a definition of what that edge is?

Timeout 30-last*1.5 (depending on the time of the last successful health check)

I did this; but shortening the timeout in response to a longer timeout feels wrong, somehow. Should we add instead of subtract?

tonistiigi · 2022-08-11T00:16:23Z

But without configuration, there are cases at the edge that cannot be supported. Does buildkit have a definition of what that edge is?

The fundamental problem seems to be that we can't easily detect if the connection is still active or not. If we had a way for asking when the last packet was successfully transferred then the answer is pretty simple. If nothing is moving for ~15sec then it is done. This doesn't mean 15sec is the right config value though as based on your description the connection is active atm. , in fact, it is so active that it saturates the connection and healthcheck can't happen.

tonistiigi · 2022-08-11T00:18:00Z

session/grpc.go

 	for {
 		select {
 		case <-ctx.Done():
 			return
 		case <-ticker.C:
-			ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
+			healthcheckStart := time.Now().UTC()


Please add a comment before this block explaining why we need to do these extra checks.

tonistiigi · 2022-08-11T00:21:37Z

session/grpc.go

-			ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
+			healthcheckStart := time.Now().UTC()
+
+			calculatedTime := maxHealthcheckDuration - time.Duration(float64(lastHealthcheckDuration)*1.5)


Yes, I didn't mean subtract in my comment. I meant in between a and b. So effectively max(maxHealthcheckDuration, time.Duration(float64(lastHealthcheckDuration)*1.5))

calculatedTime -> timeout

Did you mean min instead of max? Otherwise maxHealthcheckDuration wouldn't actually be an upper bound.

max. The healthcheck will timeout in 30sec. Except when last healtcheck took (for example) 22 sec, then it would timeout on 22*1.5=33 sec because we already know that the system is behaving slow and give it more time. (I'm not strict on what the actual content values would be).

aaronlehmann · 2022-08-11T15:23:16Z

session/grpc.go

 	defer ticker.Stop()
 	healthClient := grpc_health_v1.NewHealthClient(cc)

+	hasFailedBefore := false


Nit: simplify name to failedBefore?

I think this should reset at some point - maybe after 5 successful health checks?

aaronlehmann · 2022-08-12T19:16:26Z

session/grpc.go

+			// This healthcheck can erroneously fail in some instances, such as recieving lots of data in a low-bandwidth scenario or too many concurrent builds.
+			// So, this healthcheck is purposely long, and can tolerate some failures on purpose.
+
+			healthcheckStart := time.Now().UTC()


The .UTC() is not necessary here, since you are only checking the time elapsed since healthcheckStart.

aaronlehmann · 2022-08-12T19:18:23Z

session/grpc.go

 	defer ticker.Stop()
 	healthClient := grpc_health_v1.NewHealthClient(cc)

+	failedBefore := false
+	consecutiveSucessful := 0
+	maxHealthcheckDuration := 30 * time.Second


This name feels misleading to me because the healthcheck timeout can exceed 30 seconds. Maybe defaultHealthcheckTimeout?

aaronlehmann · 2022-08-12T19:21:10Z

session/grpc.go

+	failedBefore := false
+	consecutiveSucessful := 0
+	maxHealthcheckDuration := 30 * time.Second
+	lastHealthcheckDuration := time.Duration(0)


Where does lastHealthcheckDuration get updated?

aaronlehmann · 2022-08-12T20:09:31Z

session/grpc.go

+			healthcheckStart := time.Now()
+
+			timeout := time.Duration(math.Max(float64(defaultHealthcheckDuration), float64(lastHealthcheckDuration)*1.5))
+			lastHealthcheckDuration = timeout


This doesn't look correct. I think it should be set to the value you use for actualDuration (time.Since(healthcheckStart), after the health check)

tonistiigi

@dchw PTAL the CI failure in validation check

tonistiigi · 2022-08-15T04:26:18Z

session/grpc.go

+				consecutiveSucessful = 0
+				bklog.G(ctx).WithFields(logFields).Warn("healthcheck failed")
+			} else {
+				consecutiveSucessful++


"successful"

aaronlehmann · 2022-08-15T17:35:20Z

LGTM

tonistiigi

LGTM but please squash commits

tonistiigi · 2022-08-19T01:42:01Z

@dchw ping

Signed-off-by: Corey Larson <corey@earthly.dev> Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>

tonistiigi · 2022-08-20T01:36:25Z

Rebased and squashed

dchw mentioned this pull request Aug 2, 2022

Corey/configurable-logging-timeout earthly/buildkit-old-fork#89

Merged

dchw marked this pull request as draft August 2, 2022 23:47

tonistiigi reviewed Aug 3, 2022

View reviewed changes

tonistiigi added this to the v0.10.4 milestone Aug 8, 2022

dchw force-pushed the config-and-log-healthcheck branch from 99ff8de to adaccb4 Compare August 9, 2022 20:12

dchw force-pushed the config-and-log-healthcheck branch from adaccb4 to 532411e Compare August 9, 2022 20:26

dchw marked this pull request as ready for review August 10, 2022 19:27

dchw requested a review from tonistiigi August 10, 2022 19:27

tonistiigi requested a review from aaronlehmann August 11, 2022 00:22

tonistiigi reviewed Aug 11, 2022

View reviewed changes

aaronlehmann reviewed Aug 11, 2022

View reviewed changes

dchw force-pushed the config-and-log-healthcheck branch from 43acf01 to 5ab99c1 Compare August 12, 2022 18:45

aaronlehmann reviewed Aug 12, 2022

View reviewed changes

tonistiigi reviewed Aug 15, 2022

View reviewed changes

Add config, logging for healthcheck

b637861

Signed-off-by: Corey Larson <corey@earthly.dev> Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>

tonistiigi force-pushed the config-and-log-healthcheck branch from 34639e8 to b637861 Compare August 20, 2022 01:35

tonistiigi requested a review from AkihiroSuda August 20, 2022 01:36

tonistiigi mentioned this pull request Aug 22, 2022

[v0.10] cherry-picks for v0.10.4 #3048

Merged

tonistiigi approved these changes Aug 22, 2022

View reviewed changes

tonistiigi merged commit 8051690 into moby:master Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add config, logging for healthcheck #2998

Add config, logging for healthcheck #2998

dchw commented Aug 2, 2022 •

edited by tonistiigi

Loading

dchw commented Aug 2, 2022

tonistiigi left a comment

tonistiigi commented Aug 8, 2022

dchw commented Aug 9, 2022

tonistiigi commented Aug 11, 2022 •

edited

Loading

tonistiigi Aug 11, 2022

tonistiigi Aug 11, 2022

aaronlehmann Aug 11, 2022

tonistiigi Aug 11, 2022

aaronlehmann Aug 11, 2022

aaronlehmann Aug 12, 2022

aaronlehmann Aug 12, 2022

aaronlehmann Aug 12, 2022

aaronlehmann Aug 12, 2022

tonistiigi left a comment

tonistiigi Aug 15, 2022

aaronlehmann commented Aug 15, 2022

tonistiigi left a comment

tonistiigi commented Aug 19, 2022

tonistiigi commented Aug 20, 2022

Add config, logging for healthcheck #2998

Add config, logging for healthcheck #2998

Conversation

dchw commented Aug 2, 2022 • edited by tonistiigi Loading

dchw commented Aug 2, 2022

tonistiigi left a comment

Choose a reason for hiding this comment

tonistiigi commented Aug 8, 2022

dchw commented Aug 9, 2022

tonistiigi commented Aug 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonistiigi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronlehmann commented Aug 15, 2022

tonistiigi left a comment

Choose a reason for hiding this comment

tonistiigi commented Aug 19, 2022

tonistiigi commented Aug 20, 2022

dchw commented Aug 2, 2022 •

edited by tonistiigi

Loading

tonistiigi commented Aug 11, 2022 •

edited

Loading