-
Notifications
You must be signed in to change notification settings - Fork 2
Conversation
…ditions, or high load scenarios.
} | ||
if resp.Status != grpc_health_v1.HealthCheckResponse_SERVING { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting find.
I had a quick look at:
func (c *healthClient) Watch(ctx context.Context, in *HealthCheckRequest, opts ...grpc.CallOption) (Health_WatchClient, error) {
stream, err := c.cc.NewStream(ctx, &Health_ServiceDesc.Streams[0], "/grpc.health.v1.Health/Watch", opts...)
if err != nil {
return nil, err
}
x := &healthWatchClient{stream}
if err := x.ClientStream.SendMsg(in); err != nil {
return nil, err
}
if err := x.ClientStream.CloseSend(); err != nil {
return nil, err
}
return x, nil
}
I'm a bit confused by the CloseSend()
call occurring.
since Watch()
is called outside of the for loop, on the first call to resp, err := wc.Recv()
, it seems like it'll receive data.
does it then continue to stream in data on subsequent calls in the for loop?
Given that this code is generic to buildkit, I think it would be wise to contribute it back upstream -- it'll make maintaining our fork easier, and will give us extra confirmation that this is an appropriate fix.
It should? I was mostly following what the docs in my link say:
I haven't been able to get the service to not be |
I'lll ask around in the Slack channel to start attempting to upstream so I don't just driveby PR this into |
I did try pushing the |
That makes it sound like it does! What was the frequency of the
logs that you received while testing this? Is it possible to make this value configurable for users who want to keep the current This code is being executed on the buildkit server (and not client), right? If the timeout value is really large, could we get into a situation where we have too many open connections just waiting to timeout, which could starve network resources? (this is why I'm wondering about having a configuration option) |
I got one. Then it was waiting for another. I don't think it ever changed from
If the streaming semantics are to be believed, a timeout isn't needed, since you aren't waiting for additional requests. You're just hanging out until something changes.
Yes.
Probably. But right now we can't even build on slow internet, I would consider network resource starvation an improvement over falsely cancelled builds... guess it needs to be tunable for your usecase. Assuming we have a timeout, that is. |
This makes it sound like the health check is still being starved. This PR doesn't seem to be that different compared to just increasing the timeout |
But differently. Its a push vs. pull thing here. Large timeouts are still polling the status, whereas with streaming it should be more like a push scenario. The starvation is because Using streaming, the client should just tell us when something is wrong. So there is no tuning of timeout values needed, and even if we are starved... it doesn't matter. And the volume here should be much less since we're just reacting. |
(I recognise that you're chatting with buildkit team regarding this issue, and might be going the route of swapping hardcoded values for config values, but just wanted to respond to a couple points, no response is necessary)
The ticker will slow down to avoid this from occurring, here are the docs:
I'm still unsure about this part: if we are starved, how does the message saying things are unhealthy get through? |
Closing as superseeded by https://github.com/earthly/buildkit/pull/89 |
This should allow us to better tolerate adverse network conditions, or high load scenarios.