-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add config, logging for healthcheck #2998
Conversation
Went to draft to address some test failures. Feedback still welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to leave it configurable. Users don't want this to be configurable, they just want it to work. And if we can't find the correct configuration then they can't either.
I'd suggest:
Bump frequency to 5 sec
Timeout 30-last*1.5 (depending on the time of the last successful health check)
Allow one failure
Every failure is logged, fatal failure logged differently
@dchw Any update? |
99ff8de
to
adaccb4
Compare
I 100% agree. Just works is a great default.
Not sure this is true. You can certainly target a particular lowest-common-denominator. But without configuration, there are cases at the edge that cannot be supported. Does buildkit have a definition of what that edge is?
I did this; but shortening the timeout in response to a longer timeout feels wrong, somehow. Should we add instead of subtract? |
adaccb4
to
532411e
Compare
The fundamental problem seems to be that we can't easily detect if the connection is still active or not. If we had a way for asking when the last packet was successfully transferred then the answer is pretty simple. If nothing is moving for ~15sec then it is done. This doesn't mean 15sec is the right config value though as based on your description the connection is active atm. , in fact, it is so active that it saturates the connection and healthcheck can't happen. |
session/grpc.go
Outdated
for { | ||
select { | ||
case <-ctx.Done(): | ||
return | ||
case <-ticker.C: | ||
ctx, cancel := context.WithTimeout(ctx, 10*time.Second) | ||
healthcheckStart := time.Now().UTC() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment before this block explaining why we need to do these extra checks.
session/grpc.go
Outdated
ctx, cancel := context.WithTimeout(ctx, 10*time.Second) | ||
healthcheckStart := time.Now().UTC() | ||
|
||
calculatedTime := maxHealthcheckDuration - time.Duration(float64(lastHealthcheckDuration)*1.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't mean subtract in my comment. I meant in between a
and b
. So effectively max(maxHealthcheckDuration, time.Duration(float64(lastHealthcheckDuration)*1.5))
calculatedTime
-> timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean min
instead of max
? Otherwise maxHealthcheckDuration
wouldn't actually be an upper bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max
. The healthcheck will timeout in 30sec. Except when last healtcheck took (for example) 22 sec, then it would timeout on 22*1.5=33 sec because we already know that the system is behaving slow and give it more time. (I'm not strict on what the actual content values would be).
session/grpc.go
Outdated
defer ticker.Stop() | ||
healthClient := grpc_health_v1.NewHealthClient(cc) | ||
|
||
hasFailedBefore := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: simplify name to failedBefore
?
I think this should reset at some point - maybe after 5 successful health checks?
43acf01
to
5ab99c1
Compare
session/grpc.go
Outdated
// This healthcheck can erroneously fail in some instances, such as recieving lots of data in a low-bandwidth scenario or too many concurrent builds. | ||
// So, this healthcheck is purposely long, and can tolerate some failures on purpose. | ||
|
||
healthcheckStart := time.Now().UTC() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .UTC()
is not necessary here, since you are only checking the time elapsed since healthcheckStart
.
session/grpc.go
Outdated
defer ticker.Stop() | ||
healthClient := grpc_health_v1.NewHealthClient(cc) | ||
|
||
failedBefore := false | ||
consecutiveSucessful := 0 | ||
maxHealthcheckDuration := 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name feels misleading to me because the healthcheck timeout can exceed 30 seconds. Maybe defaultHealthcheckTimeout
?
failedBefore := false | ||
consecutiveSucessful := 0 | ||
maxHealthcheckDuration := 30 * time.Second | ||
lastHealthcheckDuration := time.Duration(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does lastHealthcheckDuration
get updated?
session/grpc.go
Outdated
healthcheckStart := time.Now() | ||
|
||
timeout := time.Duration(math.Max(float64(defaultHealthcheckDuration), float64(lastHealthcheckDuration)*1.5)) | ||
lastHealthcheckDuration = timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look correct. I think it should be set to the value you use for actualDuration
(time.Since(healthcheckStart)
, after the health check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchw PTAL the CI failure in validation check
session/grpc.go
Outdated
consecutiveSucessful = 0 | ||
bklog.G(ctx).WithFields(logFields).Warn("healthcheck failed") | ||
} else { | ||
consecutiveSucessful++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"successful"
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but please squash commits
@dchw ping |
Signed-off-by: Corey Larson <corey@earthly.dev> Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
34639e8
to
b637861
Compare
Rebased and squashed |
This adds configuration values, and logging for the gRPC healthcheck ping. By adjusting these values larger, you can get builds with many/large files to work in a bandwidth-starved environment.
Sample configuration section:
This config is completely optional, and leaving it blank (or omitting it) maintains the old behavior.
It also adds logging. A debug message for each completed healthcheck, and a warning for each failure. It also prints when it cancels the context, so if you do have odd cancellations in your use case, they have a chance at being traced back to here. Here is a small log snippet with some failing and recovering happening: