Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A17: Client-Side Health Checking #97

Merged
merged 7 commits into from
Oct 9, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
325 changes: 325 additions & 0 deletions A17-client-side-health-checking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,325 @@
Client-Side Health Checking
----
* Author(s): Mark D. Roth (roth@google.com)
* Approver: a11r
* Status: Draft
* Implemented in:
* Last updated: 2018-08-27
* Discussion at: https://groups.google.com/d/topic/grpc-io/rwcNF4IQLlQ/discussion

## Abstract

This document proposes a design for supporting application-level
health-checking on the client side.

## Background

gRPC has an existing [health-checking
mechanism](https://github.com/grpc/grpc/blob/master/doc/health-checking.md),
which allows server applications to signal that they are not healthy
without actually tearing down connections to clients. This is
used when (e.g.) a server is itself up but another service that it depends
on is not available.

Currently, this mechanism is used in some [look-aside
load-balancing](https://github.com/grpc/grpc/blob/master/doc/load-balancing.md)
implementations to provide centralized health-checking of backend
servers from the balancer infrastructure. However, there is no
support for using this health-checking mechanism from the client side,
which is needed when not using look-aside load balancing (or when
falling back from look-aside load balancing to directly contacting
backends when the balancers are unreachable).

### Related Proposals

N/A

## Proposal

The gRPC client will be able to be configured to send health-checking RPCs
to each backend that it is connected to. Whenever a backend responds
as unhealthy, the client's LB policy will stop sending requests to that
backend until it reports healthy again.

Note that because the health-checking service requires a service name, the
client will need to be configured with a service name to use. However,
by default, it can use the empty string, which would mean that the
health of all services on a given host/port would be controlled with a
single switch. Semantically, the empty string is used to represent the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the status of the empty string service is independent from the non-empty string services, and there are no implication of their relationships whatsoever?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. In terms of the implementation, the only way that the empty string is special is that at server startup, the empty string is reported as healthy by default, whereas all other services are reported as unknown by default.

overall health of the server, as opposed to the health of any individual
services running on the server.

### Watch-Based Health Checking Protocol

The current health-checking protocol is a unary API, where the client is
expected to periodically poll the server. That was sufficient for use
via centralized health checking via a look-aside load balancer, where
the health checks come from a small number of clients and where there
is existing infrastructure to periodically poll each client. However,
if we are going to have large numbers of clients issuing health-check
requests, then we will need to convert the health-checking protocol to
a streaming Watch-based API for scalability and bandwidth-usage reasons.

Note that one down-side of this approach is that it could conceivably
be possible that the server-side health-checking code somehow fails
to send an update when becoming unhealthy. If the problem is due to
a server that stops polling for I/O, then the problem would be caught
by keepalives, at which point the client would disconnect. But if the
problem is caused by a bug in the health-checking service, then it's
possible that a server could still be responding but has failed to notify
the client that it is unhealthy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also advertise the server's status via trailer metadata. So active RPCs that are issued while the service is unhealthy could attach the trailer and client could learn of the server's change in disposition that way, as a fallback (in case server failed to deliver update on main health-checking stream). This would not rely on polling, and HTTP/2 header compression should be able to encode the value efficiently if it is included in trailer for every RPC, so overhead would be very low. So if a channel were idle and became unhealthy (and server failed to deliver update), the client would not discover this until its first RPC on that channel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could certainly work, but it seems like a lot of overhead in terms of both code complexity (there would be a lot of plumbing involved) and performance (the cost of encoding and decoding the metadata -- it turns out that from a performance perspective, HTTP/2 header compression is actually one our biggest expenses). I'm not sure it's really worthwhile for something that is a very unlikely failure mode to begin with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markdroth, I'm not sure what performance cost your concerned about since the header would only need to exist when there's a problem. The bigger questions in my mind are 1) how do we know the "scope" of the failure trailer, that is, which services/methods are impacted and 2) how do we know when the service improves?

It does also require issuing an RPC to a unhealthy backend before you learn it is unhealthy, but depending on who you talk to that's more minor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be right that the performance isn't a big deal. But the code complexity still seems significant. I'm just not sure it's worth this much trouble to address something that's extremely unlikely to begin with.


#### API Changes

For the proposed API, see https://github.com/grpc/grpc-proto/pull/33.

We propose to add the following RPC method to the health checking service:

```
rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse);
```

This is exactly the same as the existing `Check()` method, except that it
is server-side streaming instead of unary. The server will be expected
to send a message immediately upon receiving the message from the client
and will then be expected to send a new message whenever the requested
service's health changes.

We will also add a new enum value to the `ServingStatus` enum:

```
SERVICE_UNKNOWN = 3;
```

This serving status will be returned by the `Watch()` method when the
requested service name is initially unknown by the server. Note that the
existing `Check()` unary method fails with status `NOT_FOUND` in this case,
and we are not proposing to change that behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not fail with a NOT_FOUND status ? It is good hygiene for server owners to initialize the healthchecking service before the server begins listening on the port, and we should not optimize for the case where the server is listening too early.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's good hygiene for servers to initialize the healthchecking status for their services before the server starts listening. However, not all servers will have good hygiene, and clients need to react somehow when they encounter a server that has bad hygiene. If the RPC just fails with NOT_FOUND in this case, then the client has to retry the call using exponential backoff, and if the server sets the service's health status to SERVING while the client is in backoff, it will delay the client from considering the subchannel healthy. I think we provide better behavior for the application by avoiding this potential delay.


### Client Behavior

Client-side health checking will be disabled by default;
service owners will need to explicitly enable it via the [service
config](#service-config-changes) when desired. There will be a channel
argument that can be used on the client to disable health checking even
if it is enabled in the service config.

The health checking client code will be built into the subchannel,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative approach is that we provide a policy add-on that resides outside of core, and can wrap a policy into a health-check-enabled one. This would not force Java to deal with the protobuf dependency, and also keep the complexity of subchannel implementation down.

/cc @ejona86

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WE should also describe what will happen for open source users. Persumably, the functionality will be available but not enabled by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is addressed at the bottom of the document, but I would recommend moving that up to avoid confusion .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

so that individual LB policies do not need to explicitly add support
for health checking. (However, see also "LB Policies Can Disable Health
Checking When Needed" below.)

Note that when an LB policy sees a subchannel go into state
`TRANSIENT_FAILURE`, it will not be able to tell whether the subchannel
has become disconnected or whether the backend has reported that it
is unhealthy. The LB policy may therefore take some unnecessary but
unharmful steps in response, such as asking the subchannel to connect
(which will be a no-op if it's already connected) or asking the resolver
to re-resolve (which will probably be throttled by the cooldown timer).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible solution here is to add another state called "connected" between "connecting" and "ready". The "connected" state tells LB policy to not re-resolve or ask the sub-channel to reconnect. Of course, this will increase the complexity of the state machine and also a re-resolution may anyway be necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These states are part of our public API in all languages; they are directly visible to applications. So to add a new state, we would need to either (a) add the new state to the public API, thus unnecessarily exposing internal implementation details to the application, or (b) use a different set of states when internally propagating the state than when exposing it to the application, which also adds complexity to our code. I think either of those options would be a fair amount of trouble, and that doesn't seem worthwhile, given that the extra work being done here is unharmful.

If this extra work does turn out to be harmful at some point in the future, I think we can find ways to avoid it without having to add a new connectivity state.


#### Client Workflow

When a subchannel first establishes a connection, if health checking
is enabled, the client will immediately start the `Watch()` call to the
backend. The subchannel will stay in state CONNECTING until it receives
the first health-check response from the backend (see "Race Conditions
and Subchannel Startup" below).

When a client receives a health-checking response from the backend,
if the health check response indicates that the backend is healthy, the
subchannel will transition to state READY; otherwise, it will transition
to state `TRANSIENT_FAILURE`. Note that this means that when a backend
transitions from unhealthy to healthy, the subchannel's connectivity state
will transition from `TRANSIENT_FAILURE` directly to `READY`, with no stop at
CONNECTING in between. This is a new transition that was not [previously
supported](https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md), although it is unlikely to be a problem for applications,
because (a) it won't affect the overall state of the parent channel
unless all subchannels are in the same state at the same time and (b)
because the connectivity state API can drop states, the application won't
be able to tell that we didn't stop in `CONNECTING` in between anyway.

If the `Watch()` call fails with status `UNIMPLEMENTED`, the client will
act as if health checking is disabled. That is, it will not retry
the health-checking call, but it will treat the channel as healthy
(connectivity state `READY`). However, the client will record a [channel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be another point that will be interesting/hard for Java to figure out what to do, as creating trace events is very internal to the transport, including its synchronization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. I don't think this requires any changes here, though.

trace](https://github.com/grpc/proposal/blob/master/A3-channel-tracing.md)
event indicating that this has happened. It will also log a message at
priority ERROR.

If the `Watch()` call returns any other status, the subchannel will
transition to connectivity state `TRANSIENT_FAILURE` and will retry the
call. To avoid hammering a server that may be experiencing problems,
the client will use exponential backoff between attempts. When the client
receives a message from the server on a given call, the backoff state is
reset, so the next attempt will occur immediately, but any subsequent
attempt will be subject to exponential backoff. When the next attempt
starts, the subchannel will transition to state `CONNECTING`.

When the client subchannel is shutting down or when the backend sends a

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the server never closes the stream, will it stop server from shutting down cleanly? Should the server close all health check streams when shutting down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an implementation detail. If the server is shutting down, it will send a GOAWAY, which will cause the client to cancel the stream. If the server has already proactively cancelled the stream, that won't break anything.

GOAWAY, the client will cancel the `Watch()` call. There is no need to
wait for the final status from the server in this case.

#### Race Conditions and Subchannel Startup

Note that because of the inherently asynchronous nature of the network,
whenever a backend transitions from healthy to unhealthy, it may still
receive some number of RPCs that were already in flight from the
client before the client received the notification that the backend is
unhealthy. This race condition lasts approximately the network round
trip time (it includes the one-way trip time for RPCs that were already
sent by the client but not yet received by the server when the server
goes unhealthy, plus the one-way trip time for RPCs sent by the client
between when the server sends the unhealthy notification and when the
client receives it).

When the connection is first established, however, the problem could
affect a larger number of RPCs, because there could be a number of RPCs
queued up waiting for the channel to become connected, which would all
be sent at once. And unlike an already-established connection, the race
condition is avoidable for newly established connections.

To avoid this, the client will wait for the initial health-checking
response before the subchannel goes into state `READY`. However, this
does mean that when health checking is enabled, we require an additional
network RTT before the subchannel can be used. If this becomes a problem
for users that cannot be solved by simply disabling health-checking,
we can consider adding an option in the future to treat the subchannel
as healthy until the initial health-checking response is received.

#### Call Credentials

If there are any call credentials associated with the channel, the client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: this does not exist in Java.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. I don't think the text needs to be changed, though; Java is just a case where there are no call creds associated with the channel.

will send those credentials with the `Watch()` call. However, we will not
provide a way for the client to explicitly add call credentials for the
`Watch()` call.

### Service Config Changes

We will add the following new message to the [service
config](https://github.com/grpc/grpc/blob/master/doc/service_config.md):

```
message HealthCheckConfig {
// Service name to use in the health-checking request.
google.protobuf.StringValue service_name = 1;
}
```

We will then add the following top-level field to the ServiceConfig proto:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a link to ServiceConfig proto?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added link to service config doc.

The doc shows the config in JSON form, not proto form. I've been meaning to update the docs to publish the proto form as well, but haven't gotten around to it yet. In the interim, I've added an example here that shows how to express this in JSON form.

```
HealthCheckConfig health_check_config = 4;
```

Note that we currently need only this one parameter, but we are putting
it in its own message so that we have the option of adding more parameters
later if needed.

Here is an example of how one would set the health checking service name
to "foo" in the service config in JSON form:

```
"healthCheckConfig": {"serviceName": "foo"}
```

### LB Policies Can Disable Health Checking When Needed

There are some cases where an LB policy may want to disable client-side
health-checking. To allow this, LB policies will be able to set the
channel argument mentioned above to inhibit health checking.

This section details how each of our existing LB policies will interact
with health checking.

#### `pick_first`

We do not plan to support health checking with `pick_first`, because it
is not clear what the behavior of `pick_first` would be for unhealthy
channels. The naive approach of treating an unhealthy channel as having
failed and disconnecting would both result in expensive reconnection
attempts and might not actually cause us to select a different backend
after re-resolving.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more details here. Intuitively, it seems like pick-first policy should skip the address if it is not healthy and pick the next healthy one. The issue of expensive reconnection attempts (subject to exponential backoff) also exists for RR policy, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this issue does not exist for RR, because RR never disconnects from a backend -- it will always be attempting to connect to all backend addresses in its list. So if a backend connection fails in RR, RR just keeps trying. From the subchannel's point of view, that does not cause any connection churn, regardless of whether the subchannel is actually disconnected or is connected but reporting unhealthy.

In contrast, in PF, if a backend connection fails, PF gives up on it and goes back to its list of addresses to try to find one that will work. That's why this is harder for PF -- it would actually cause us to try to switch to a different address, which would cause a lot of connection churn.


The `pick_first` LB policy will unconditionally set the channel arg to
inhibit health checking.

#### `round_robin`

Unhealthiness in `round_robin` will be handled in the obvious way: the
subchannel will be considered to not be in state `READY`, and picks will
not be assigned to it.

#### `grpclb`

The `grpclb` LB policy will set the channel arg to inhibit health checking
when we are using backend addresses obtained from the balancer, on the
assumption that the balancer is doing centralized health-checking. The arg
will not be set when using fallback addresses, since we want client-side
health checking for those.

### Caveats

Note that health checking will use a stream on every connection.
This stream counts toward the HTTP/2 `MAX_CONCURRENT_STREAMS` setting.
So, for example, any gRPC client or server that sets this to 1 will not
allow any other streams on the connection.

Enabling health checks may affect the `MAX_CONNECTION_IDLE` setting.
We do not expect this to be a problem, since we are only implementing
client-side health checking in `round_robin`, which always attempts to
reconnect to every backend after an idle timeout anyway.

## Rationale

We discussed a large number of alternative approaches. The two most
appealling are listed here, along with the reasons why they were not
chosen.

1. Use HTTP/2 SETTINGS type to indicate unhealthiness. Would avoid proto
dependency issues in core. However, would be transport-specific and would
not work for HTTP/2 proxies. Also would not allow signalling different
health status for multiple services on the same port. Would also
require server applications to use different APIs for the centralized
health checking case and the client-side health checking case.

2. Have server stop listening when unhealthy. This would be challenging
to implement in Go due to the structure of the listener API. It would
also not allow signalling different health status for multiple services on
the same port. Would also require server applications to use different
APIs for the centralized health checking case and the client-side health
checking case. Would not allow individual clients to decide whether to
do health checking. In cases where the server's health was flapping up
and down, would cause a lot of overhead from connections being killed
and reestablished.

## Implementation

### C-core

- Implement new `Watch()` method in health checking service
(https://github.com/grpc/grpc/pull/16351).
- Add support for health checking in subchannel code using nanopb
(in progress).

### Java
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will also need plumbing to allow setting the tracing event.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


- Implement new `Watch()` method in health checking service.
- May need to redesign `IDLE_TIMEOUT`. Currently based on number of RPCs
on each transport (zero vs non-zero). Will need some way to exclude
the health-check RPC from this count.
- GRPC-LB needs to be aware of the setting to disable health checking on
its connections to backends.
- May need yet another runtime registry to find client health checking
implementation. Alternatively may hand-roll the protobuf
serialization.
- Will need plumbing to allow creating a trace event.

### Go

- Implement new `Watch()` method in health checking service.
- Allow RPCs on a subchannel.
- Add mechanism in subconn for the health-check implementation,
optionally invoking it when subchannels are created, and affecting
connectivity state accordingly.
- Implement and register health checking in an optional package (to
enable/configure feature).