Skip to content
This repository has been archived by the owner on Apr 20, 2023. It is now read-only.

Use a stream-based healthcheck #88

Closed
wants to merge 1 commit into from
Closed

Conversation

dchw
Copy link

@dchw dchw commented Jul 25, 2022

This should allow us to better tolerate adverse network conditions, or high load scenarios.

@dchw dchw requested review from vladaionescu and alexcb July 25, 2022 22:51
@dchw dchw self-assigned this Jul 25, 2022
}
if resp.Status != grpc_health_v1.HealthCheckResponse_SERVING {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@alexcb alexcb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting find.

I had a quick look at:

func (c *healthClient) Watch(ctx context.Context, in *HealthCheckRequest, opts ...grpc.CallOption) (Health_WatchClient, error) {
	stream, err := c.cc.NewStream(ctx, &Health_ServiceDesc.Streams[0], "/grpc.health.v1.Health/Watch", opts...)
	if err != nil {
		return nil, err
	}
	x := &healthWatchClient{stream}
	if err := x.ClientStream.SendMsg(in); err != nil {
		return nil, err
	}
	if err := x.ClientStream.CloseSend(); err != nil {
		return nil, err
	}
	return x, nil
}

I'm a bit confused by the CloseSend() call occurring.

since Watch() is called outside of the for loop, on the first call to resp, err := wc.Recv(), it seems like it'll receive data.

does it then continue to stream in data on subsequent calls in the for loop?

Given that this code is generic to buildkit, I think it would be wise to contribute it back upstream -- it'll make maintaining our fork easier, and will give us extra confirmation that this is an appropriate fix.

@dchw
Copy link
Author

dchw commented Jul 25, 2022

does it then continue to stream in data on subsequent calls in the for loop?

It should? I was mostly following what the docs in my link say:

A client can call the Watch method to perform a streaming health-check. The server will immediately send back a message indicating the current serving status. It will then subsequently send a new message whenever the service's serving status changes.

I haven't been able to get the service to not be SERVING, so... one is enough I guess?

@dchw
Copy link
Author

dchw commented Jul 25, 2022

I'lll ask around in the Slack channel to start attempting to upstream so I don't just driveby PR this into buildkit. In the mid-term, I would love to start testing this in our satellites to gather some empirical proof.

@dchw
Copy link
Author

dchw commented Jul 25, 2022

I did try pushing the Watch in to the loop, and I can get more successful checks. However, they stop once large files begin transferring.

@alexcb
Copy link

alexcb commented Jul 25, 2022

does it then continue to stream in data on subsequent calls in the for loop?

It should? I was mostly following what the docs in my link say:

A client can call the Watch method to perform a streaming health-check. The server will immediately send back a message indicating the current serving status. It will then subsequently send a new message whenever the service's serving status changes.

That makes it sound like it does!

What was the frequency of the

bklog.G(ctx).Debugf("successful healthcheck recieved")

logs that you received while testing this?

Is it possible to make this value configurable for users who want to keep the current 10*time.Second timeout value?

This code is being executed on the buildkit server (and not client), right? If the timeout value is really large, could we get into a situation where we have too many open connections just waiting to timeout, which could starve network resources? (this is why I'm wondering about having a configuration option)

@dchw
Copy link
Author

dchw commented Jul 25, 2022

What was the frequency

I got one. Then it was waiting for another. I don't think it ever changed from SERVING earthly-client side, so... to be expected? I did try pushing the .Watch() call into the for loop, and it spammed healthchecks just like the regular RPC call. At that point, we aren't any different than what we had before. Here is what the logs say:

time="2022-07-25T22:39:52Z" level=debug msg="starting healthcheck"
time="2022-07-25T22:39:52Z" level=debug msg="successful healthcheck"
time="2022-07-25T22:39:52Z" level=debug msg="starting healthcheck"

Is it possible to make this value configurable for users who want to keep the current 10*time.Second timeout value?

If the streaming semantics are to be believed, a timeout isn't needed, since you aren't waiting for additional requests. You're just hanging out until something changes.

This code is being executed on the buildkit server

Yes.

If the timeout value is really large, could we get into a situation where we have too many open connections just waiting to timeout, which could starve network resources?

Probably. But right now we can't even build on slow internet, I would consider network resource starvation an improvement over falsely cancelled builds... guess it needs to be tunable for your usecase. Assuming we have a timeout, that is.

@alexcb
Copy link

alexcb commented Jul 25, 2022

I did try pushing the Watch in to the loop, and I can get more successful checks. However, they stop once large files begin transferring.

This makes it sound like the health check is still being starved.

This PR doesn't seem to be that different compared to just increasing the timeout ctx, cancel := context.WithTimeout(ctx, 10*time.Second) to a really really large value.

@dchw
Copy link
Author

dchw commented Jul 25, 2022

This makes it sound like the health check is still being starved.

But differently. Its a push vs. pull thing here. Large timeouts are still polling the status, whereas with streaming it should be more like a push scenario.

The starvation is because buildkit is handling the (slowly) incoming data. Even with large timeouts, we are spamming requests (at whatever rate we configure), and they all would need an answer. Additionally, we will accumulate healthchecks until we can service them due to the ticker.

Using streaming, the client should just tell us when something is wrong. So there is no tuning of timeout values needed, and even if we are starved... it doesn't matter. And the volume here should be much less since we're just reacting.

@alexcb
Copy link

alexcb commented Jul 26, 2022

(I recognise that you're chatting with buildkit team regarding this issue, and might be going the route of swapping hardcoded values for config values, but just wanted to respond to a couple points, no response is necessary)

Even with large timeouts, we are spamming requests (at whatever rate we configure), and they all would need an answer. Additionally, we will accumulate healthchecks until we can service them due to the ticker.

The ticker will slow down to avoid this from occurring, here are the docs:

// NewTicker returns a new Ticker containing a channel that will send
// the current time on the channel after each tick. The period of the
// ticks is specified by the duration argument. The ticker will adjust
// the time interval or drop ticks to make up for slow receivers.
// The duration d must be greater than zero; if not, NewTicker will
// panic. Stop the ticker to release associated resources.
func NewTicker(d Duration) *Ticker {

So there is no tuning of timeout values needed, and even if we are starved... it doesn't matter

I'm still unsure about this part: if we are starved, how does the message saying things are unhealthy get through?

@vladaionescu
Copy link
Member

Closing as superseeded by https://github.com/earthly/buildkit/pull/89

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants