Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config, logging for healthcheck #2998

Merged
merged 1 commit into from
Aug 22, 2022

Conversation

dchw
Copy link
Contributor

@dchw dchw commented Aug 2, 2022

This adds configuration values, and logging for the gRPC healthcheck ping. By adjusting these values larger, you can get builds with many/large files to work in a bandwidth-starved environment.

Sample configuration section:

[health]
  frequency = "10s"
  timeout = "1m"
  allowedFailures = 3

This config is completely optional, and leaving it blank (or omitting it) maintains the old behavior.

It also adds logging. A debug message for each completed healthcheck, and a warning for each failure. It also prints when it cancels the context, so if you do have odd cancellations in your use case, they have a chance at being traced back to here. Here is a small log snippet with some failing and recovering happening:

time="2022-08-02T22:28:02Z" level=debug msg="diff applied" d=91.038907ms digest="sha256:ab6db1bc80d0a6df92d04c3fad44b9443642fbc85878023bc8c011763fe44524" media=application/vnd.docker.image.rootfs.diff.tar.gzip size=2814645
time="2022-08-02T22:28:03Z" level=debug msg="healthcheck completed" actualDuration="734.564µs" timeout=17s
time="2022-08-02T22:28:04Z" level=debug msg="healthcheck completed" actualDuration="814.28µs" timeout=17s
time="2022-08-02T22:28:05Z" level=debug msg="healthcheck completed" actualDuration="720.232µs" timeout=17s
time="2022-08-02T22:28:06Z" level=debug msg="healthcheck completed" actualDuration=2.065139ms timeout=17s
time="2022-08-02T22:28:07Z" level=debug msg="healthcheck completed" actualDuration=2.041702ms timeout=17s
time="2022-08-02T22:28:07Z" level=debug msg="new ref for local: a5224clhf1nesi63z3zsno1zs" span="[context .] local context ."
time="2022-08-02T22:28:25Z" level=warning msg="healthcheck failed" actualDuration=17.000346522s allowedFailures=3 consecutiveFailures=1 timeout=17s
time="2022-08-02T22:28:42Z" level=warning msg="healthcheck failed" actualDuration=17.000424726s allowedFailures=3 consecutiveFailures=2 timeout=17s
time="2022-08-02T22:28:58Z" level=debug msg="healthcheck completed" actualDuration=16.496519198s timeout=17s
time="2022-08-02T22:29:15Z" level=warning msg="healthcheck failed" actualDuration=17.000784359s allowedFailures=3 consecutiveFailures=1 timeout=17s
time="2022-08-02T22:29:32Z" level=warning msg="healthcheck failed" actualDuration=17.000419603s allowedFailures=3 consecutiveFailures=2 timeout=17s
time="2022-08-02T22:29:48Z" level=debug msg="healthcheck completed" actualDuration=15.841662313s timeout=17s
time="2022-08-02T22:30:05Z" level=warning msg="healthcheck failed" actualDuration=17.000593778s allowedFailures=3 consecutiveFailures=1 timeout=17s
time="2022-08-02T22:30:22Z" level=warning msg="healthcheck failed" actualDuration=17.001052249s allowedFailures=3 consecutiveFailures=2 timeout=17s
time="2022-08-02T22:30:38Z" level=debug msg="healthcheck completed" actualDuration=16.517479546s timeout=17s
time="2022-08-02T22:30:55Z" level=warning msg="healthcheck failed" actualDuration=17.000286526s allowedFailures=3 consecutiveFailures=1 timeout=17s
time="2022-08-02T22:31:12Z" level=warning msg="healthcheck failed" actualDuration=17.000703879s allowedFailures=3 consecutiveFailures=2 timeout=17s
time="2022-08-02T22:31:29Z" level=warning msg="healthcheck failed" actualDuration=17.000178969s allowedFailures=3 consecutiveFailures=3 timeout=17s
time="2022-08-02T22:31:29Z" level=error msg="healthcheck failed too many times"

@dchw dchw marked this pull request as draft August 2, 2022 23:47
@dchw
Copy link
Contributor Author

dchw commented Aug 2, 2022

Went to draft to address some test failures. Feedback still welcome.

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to leave it configurable. Users don't want this to be configurable, they just want it to work. And if we can't find the correct configuration then they can't either.

I'd suggest:

Bump frequency to 5 sec
Timeout 30-last*1.5 (depending on the time of the last successful health check)
Allow one failure
Every failure is logged, fatal failure logged differently

@tonistiigi tonistiigi added this to the v0.10.4 milestone Aug 8, 2022
@tonistiigi
Copy link
Member

@dchw Any update?

@dchw dchw force-pushed the config-and-log-healthcheck branch from 99ff8de to adaccb4 Compare August 9, 2022 20:12
@dchw
Copy link
Contributor Author

dchw commented Aug 9, 2022

I'd prefer not to leave it configurable. Users don't want this to be configurable, they just want it to work.

I 100% agree. Just works is a great default.

And if we can't find the correct configuration then they can't either.

Not sure this is true. You can certainly target a particular lowest-common-denominator. But without configuration, there are cases at the edge that cannot be supported. Does buildkit have a definition of what that edge is?

Timeout 30-last*1.5 (depending on the time of the last successful health check)

I did this; but shortening the timeout in response to a longer timeout feels wrong, somehow. Should we add instead of subtract?

@dchw dchw force-pushed the config-and-log-healthcheck branch from adaccb4 to 532411e Compare August 9, 2022 20:26
@dchw dchw marked this pull request as ready for review August 10, 2022 19:27
@dchw dchw requested a review from tonistiigi August 10, 2022 19:27
@tonistiigi
Copy link
Member

tonistiigi commented Aug 11, 2022

But without configuration, there are cases at the edge that cannot be supported. Does buildkit have a definition of what that edge is?

The fundamental problem seems to be that we can't easily detect if the connection is still active or not. If we had a way for asking when the last packet was successfully transferred then the answer is pretty simple. If nothing is moving for ~15sec then it is done. This doesn't mean 15sec is the right config value though as based on your description the connection is active atm. , in fact, it is so active that it saturates the connection and healthcheck can't happen.

session/grpc.go Outdated
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
healthcheckStart := time.Now().UTC()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment before this block explaining why we need to do these extra checks.

session/grpc.go Outdated
ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
healthcheckStart := time.Now().UTC()

calculatedTime := maxHealthcheckDuration - time.Duration(float64(lastHealthcheckDuration)*1.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't mean subtract in my comment. I meant in between a and b. So effectively max(maxHealthcheckDuration, time.Duration(float64(lastHealthcheckDuration)*1.5))

calculatedTime -> timeout

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean min instead of max? Otherwise maxHealthcheckDuration wouldn't actually be an upper bound.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max. The healthcheck will timeout in 30sec. Except when last healtcheck took (for example) 22 sec, then it would timeout on 22*1.5=33 sec because we already know that the system is behaving slow and give it more time. (I'm not strict on what the actual content values would be).

session/grpc.go Outdated
defer ticker.Stop()
healthClient := grpc_health_v1.NewHealthClient(cc)

hasFailedBefore := false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: simplify name to failedBefore?

I think this should reset at some point - maybe after 5 successful health checks?

@dchw dchw force-pushed the config-and-log-healthcheck branch from 43acf01 to 5ab99c1 Compare August 12, 2022 18:45
session/grpc.go Outdated
// This healthcheck can erroneously fail in some instances, such as recieving lots of data in a low-bandwidth scenario or too many concurrent builds.
// So, this healthcheck is purposely long, and can tolerate some failures on purpose.

healthcheckStart := time.Now().UTC()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .UTC() is not necessary here, since you are only checking the time elapsed since healthcheckStart.

session/grpc.go Outdated
defer ticker.Stop()
healthClient := grpc_health_v1.NewHealthClient(cc)

failedBefore := false
consecutiveSucessful := 0
maxHealthcheckDuration := 30 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name feels misleading to me because the healthcheck timeout can exceed 30 seconds. Maybe defaultHealthcheckTimeout?

failedBefore := false
consecutiveSucessful := 0
maxHealthcheckDuration := 30 * time.Second
lastHealthcheckDuration := time.Duration(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does lastHealthcheckDuration get updated?

session/grpc.go Outdated
healthcheckStart := time.Now()

timeout := time.Duration(math.Max(float64(defaultHealthcheckDuration), float64(lastHealthcheckDuration)*1.5))
lastHealthcheckDuration = timeout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look correct. I think it should be set to the value you use for actualDuration (time.Since(healthcheckStart), after the health check)

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchw PTAL the CI failure in validation check

session/grpc.go Outdated
consecutiveSucessful = 0
bklog.G(ctx).WithFields(logFields).Warn("healthcheck failed")
} else {
consecutiveSucessful++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"successful"

@aaronlehmann
Copy link
Collaborator

LGTM

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but please squash commits

@tonistiigi
Copy link
Member

@dchw ping

Signed-off-by: Corey Larson <corey@earthly.dev>
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
@tonistiigi tonistiigi force-pushed the config-and-log-healthcheck branch from 34639e8 to b637861 Compare August 20, 2022 01:35
@tonistiigi tonistiigi requested a review from AkihiroSuda August 20, 2022 01:36
@tonistiigi
Copy link
Member

Rebased and squashed

@tonistiigi tonistiigi merged commit 8051690 into moby:master Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants