Use a stream-based healthcheck #88

dchw · 2022-07-25T22:51:30Z

This should allow us to better tolerate adverse network conditions, or high load scenarios.

…ditions, or high load scenarios.

dchw · 2022-07-25T22:53:47Z

session/grpc.go

 		}
+		if resp.Status != grpc_health_v1.HealthCheckResponse_SERVING {


https://github.com/grpc/grpc/blob/master/doc/health-checking.md

alexcb

interesting find.

I had a quick look at:

func (c *healthClient) Watch(ctx context.Context, in *HealthCheckRequest, opts ...grpc.CallOption) (Health_WatchClient, error) {
	stream, err := c.cc.NewStream(ctx, &Health_ServiceDesc.Streams[0], "/grpc.health.v1.Health/Watch", opts...)
	if err != nil {
		return nil, err
	}
	x := &healthWatchClient{stream}
	if err := x.ClientStream.SendMsg(in); err != nil {
		return nil, err
	}
	if err := x.ClientStream.CloseSend(); err != nil {
		return nil, err
	}
	return x, nil
}

I'm a bit confused by the CloseSend() call occurring.

since Watch() is called outside of the for loop, on the first call to resp, err := wc.Recv(), it seems like it'll receive data.

does it then continue to stream in data on subsequent calls in the for loop?

Given that this code is generic to buildkit, I think it would be wise to contribute it back upstream -- it'll make maintaining our fork easier, and will give us extra confirmation that this is an appropriate fix.

dchw · 2022-07-25T23:28:41Z

does it then continue to stream in data on subsequent calls in the for loop?

It should? I was mostly following what the docs in my link say:

A client can call the Watch method to perform a streaming health-check. The server will immediately send back a message indicating the current serving status. It will then subsequently send a new message whenever the service's serving status changes.

I haven't been able to get the service to not be SERVING, so... one is enough I guess?

dchw · 2022-07-25T23:30:14Z

I'lll ask around in the Slack channel to start attempting to upstream so I don't just driveby PR this into buildkit. In the mid-term, I would love to start testing this in our satellites to gather some empirical proof.

dchw · 2022-07-25T23:38:01Z

I did try pushing the Watch in to the loop, and I can get more successful checks. However, they stop once large files begin transferring.

alexcb · 2022-07-25T23:38:23Z

does it then continue to stream in data on subsequent calls in the for loop?

It should? I was mostly following what the docs in my link say:

A client can call the Watch method to perform a streaming health-check. The server will immediately send back a message indicating the current serving status. It will then subsequently send a new message whenever the service's serving status changes.

That makes it sound like it does!

What was the frequency of the

bklog.G(ctx).Debugf("successful healthcheck recieved")

logs that you received while testing this?

Is it possible to make this value configurable for users who want to keep the current 10*time.Second timeout value?

This code is being executed on the buildkit server (and not client), right? If the timeout value is really large, could we get into a situation where we have too many open connections just waiting to timeout, which could starve network resources? (this is why I'm wondering about having a configuration option)

dchw · 2022-07-25T23:43:03Z

What was the frequency

I got one. Then it was waiting for another. I don't think it ever changed from SERVING earthly-client side, so... to be expected? I did try pushing the .Watch() call into the for loop, and it spammed healthchecks just like the regular RPC call. At that point, we aren't any different than what we had before. Here is what the logs say:

time="2022-07-25T22:39:52Z" level=debug msg="starting healthcheck"
time="2022-07-25T22:39:52Z" level=debug msg="successful healthcheck"
time="2022-07-25T22:39:52Z" level=debug msg="starting healthcheck"

Is it possible to make this value configurable for users who want to keep the current 10*time.Second timeout value?

If the streaming semantics are to be believed, a timeout isn't needed, since you aren't waiting for additional requests. You're just hanging out until something changes.

This code is being executed on the buildkit server

Yes.

If the timeout value is really large, could we get into a situation where we have too many open connections just waiting to timeout, which could starve network resources?

Probably. But right now we can't even build on slow internet, I would consider network resource starvation an improvement over falsely cancelled builds... guess it needs to be tunable for your usecase. Assuming we have a timeout, that is.

alexcb · 2022-07-25T23:46:15Z

I did try pushing the Watch in to the loop, and I can get more successful checks. However, they stop once large files begin transferring.

This makes it sound like the health check is still being starved.

This PR doesn't seem to be that different compared to just increasing the timeout ctx, cancel := context.WithTimeout(ctx, 10*time.Second) to a really really large value.

dchw · 2022-07-25T23:55:26Z

This makes it sound like the health check is still being starved.

But differently. Its a push vs. pull thing here. Large timeouts are still polling the status, whereas with streaming it should be more like a push scenario.

The starvation is because buildkit is handling the (slowly) incoming data. Even with large timeouts, we are spamming requests (at whatever rate we configure), and they all would need an answer. Additionally, we will accumulate healthchecks until we can service them due to the ticker.

Using streaming, the client should just tell us when something is wrong. So there is no tuning of timeout values needed, and even if we are starved... it doesn't matter. And the volume here should be much less since we're just reacting.

alexcb · 2022-07-26T16:33:48Z

(I recognise that you're chatting with buildkit team regarding this issue, and might be going the route of swapping hardcoded values for config values, but just wanted to respond to a couple points, no response is necessary)

Even with large timeouts, we are spamming requests (at whatever rate we configure), and they all would need an answer. Additionally, we will accumulate healthchecks until we can service them due to the ticker.

The ticker will slow down to avoid this from occurring, here are the docs:

// NewTicker returns a new Ticker containing a channel that will send
// the current time on the channel after each tick. The period of the
// ticks is specified by the duration argument. The ticker will adjust
// the time interval or drop ticks to make up for slow receivers.
// The duration d must be greater than zero; if not, NewTicker will
// panic. Stop the ticker to release associated resources.
func NewTicker(d Duration) *Ticker {

So there is no tuning of timeout values needed, and even if we are starved... it doesn't matter

I'm still unsure about this part: if we are starved, how does the message saying things are unhealthy get through?

vladaionescu · 2022-08-05T16:34:00Z

Closing as superseeded by https://github.com/earthly/buildkit/pull/89

Use a stream-based healthcheck to better tolerate adverse network con…

db96553

…ditions, or high load scenarios.

dchw requested review from vladaionescu and alexcb July 25, 2022 22:51

dchw self-assigned this Jul 25, 2022

dchw commented Jul 25, 2022

View reviewed changes

alexcb reviewed Jul 25, 2022

View reviewed changes

vladaionescu closed this Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a stream-based healthcheck #88

Use a stream-based healthcheck #88

dchw commented Jul 25, 2022

dchw Jul 25, 2022

alexcb left a comment

dchw commented Jul 25, 2022 •

edited

Loading

dchw commented Jul 25, 2022

dchw commented Jul 25, 2022

alexcb commented Jul 25, 2022

dchw commented Jul 25, 2022 •

edited

Loading

alexcb commented Jul 25, 2022

dchw commented Jul 25, 2022 •

edited

Loading

alexcb commented Jul 26, 2022

vladaionescu commented Aug 5, 2022

		}
		if resp.Status != grpc_health_v1.HealthCheckResponse_SERVING {

Use a stream-based healthcheck #88

Use a stream-based healthcheck #88

Conversation

dchw commented Jul 25, 2022

dchw Jul 25, 2022

Choose a reason for hiding this comment

alexcb left a comment

Choose a reason for hiding this comment

dchw commented Jul 25, 2022 • edited Loading

dchw commented Jul 25, 2022

dchw commented Jul 25, 2022

alexcb commented Jul 25, 2022

dchw commented Jul 25, 2022 • edited Loading

alexcb commented Jul 25, 2022

dchw commented Jul 25, 2022 • edited Loading

alexcb commented Jul 26, 2022

vladaionescu commented Aug 5, 2022

dchw commented Jul 25, 2022 •

edited

Loading

dchw commented Jul 25, 2022 •

edited

Loading

dchw commented Jul 25, 2022 •

edited

Loading