Skip to content

Commit

Permalink
Introduce heartbeats (#190)
Browse files Browse the repository at this point in the history
This PR introduces a client side heartbeat that can be set via the OpAMP Connection Settings. After discussing more in the past SIG meeting, I worked through the following pro/con for a client side vs server side heartbeat.

closes #183

## Client Side
A client side heartbeat would be negotiated between the client and server, where the client's default heartbeat is set to 30s. On connect, the server has an option to entirely disable heartbeats by setting an explicit 0 for the field. The server can also offer a different heartbeat interval depending on its infrastructure's needs. After connection, the agent will begin a timer and each N seconds will send a message that minimally contains its instance id. An HTTP connection will use this interval for its polling as well. Some agents that are not directly informed of health changes should also use this for component health reporting. 

### Pros
1. The agent is able to prevent a proxy from timing out the socket connection
2. The agent's HTTP polling interval is now configurable
3. The server can properly age out and remove dead agents
4. Requires a single successful message for heartbeat processing

### Cons
1. requires a proto change
2. extra work for the client
  a. the client will now need to keep track of a heartbeat timer to send this periodic message

## Server Side
A server side heartbeat would simply be a part of opamp-go and would require no changes to the spec to allow this to work. The server every N seconds would send an empty message to the client to keep the socket connection active. 

### Pros
1. Requires no spec or proto changes
2. Server is in control of the interval

### Cons
1. Server has no way to determine if an Agent is dead
  a. The core value of the change to me is that the server can now rely on the fact it is receiving a message every N seconds and can take action if that is the case
  b. If the client is using an http transport, there is no way for the server to reliably send a heartbeat message to guarantee the liveness of the agent. Say the server 'requests' a heartbeat from the client, but the client is already dead 
2. The only way to send an 'empty' server to agent message today is by using the report full state flag. This means the message back from the agent is going to be larger than necessary solely to keep the connection.
  a. We could also add a heartbeat flag as part of the message
3. Requires three successful messages for heartbeat processing
  a. A server would need to successfully send the heartbeat flagged message over the websocket, the client would then send its heartbeat back via an AgentToServer message, and the server would need to ACK with a responding ServerToAgent message. 

I think given the pros and cons of the above, I prefer an Agent heartbeat over a server heartbeat. If we need a new proto change anyway to introduce a heartbeat flag, I think the client approach is more effective. Furthermore, this change helps provide guidance for agents that are not informed of status updates. By setting an explicit heartbeat for the client, the server can increase the granularity of the agent's status updates. The client heartbeat approach also matches the design for a conventional deadman's switch – something that is constantly sending a signal out for a receiver to detect only when that signal is no longer received. Flipping that design removes that guarantee and weakens the overall feature. 

Finally, the server would also be able to explicitly disconnect misbehaving clients and force them to reconnect with new settings. If the server were to not receive a heartbeat within its set window, the server could initiate a disconnect to gracefully close the client. This approach would not work as well for a server heartbeat as it would need to cancel the initial server to agent message.

## References
- [RabbitMQ](https://www.rabbitmq.com/docs/heartbeats#tcp-keepalives)
  - Rabbit prefers client heartbeats over server ones, AND explicit heartbeat instead of TCP keepalives
- [MQTT](https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901094)
  - Functionally the same as this proposal, they call them keep alives
- [Phoenix Socket Client](https://hexdocs.pm/phoenix/writing_a_channels_client.html#message-format)
  - Works by the client sending a specific heartbeat message
  • Loading branch information
jaronoff97 authored Jul 29, 2024
1 parent fa39d6f commit 58acf6b
Show file tree
Hide file tree
Showing 2 changed files with 100 additions and 9 deletions.
21 changes: 20 additions & 1 deletion proto/opamp.proto
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,19 @@ message OpAMPConnectionSettings {
// This field is optional: if omitted the client SHOULD NOT use a client-side certificate.
// This field can be used to perform a client certificate revocation/rotation.
TLSCertificate certificate = 3;

// The Agent MUST periodically send an AgentToServer message if the
// AgentCapabilities_ReportsHeartbeat capability is true. At a minimum the instance_uid
// field MUST be set.
//
// An HTTP Client MUST use the value as polling interval, if heartbeat_interval_seconds is non-zero.
//
// A heartbeat is used to keep the connection active and inform the server that the Agent
// is still alive and active.
//
// If this field has no value or is set to 0, the Agent should not send any heartbeats.
// Status: [Development]
uint64 heartbeat_interval_seconds = 4;
}

// The TelemetryConnectionSettings message is a collection of fields which comprise an
Expand Down Expand Up @@ -635,7 +648,13 @@ enum AgentCapabilities {
AgentCapabilities_ReportsHealth = 0x00000800;
// The Agent will report RemoteConfig status via AgentToServer.remote_config_status field.
AgentCapabilities_ReportsRemoteConfig = 0x00001000;

// The Agent can report heartbeats.
// This is specified by the ServerToAgent.OpAMPConnectionSettings.heartbeat_interval_seconds field.
// If this capability is true, but the Server does not set a heartbeat_interval_seconds field, the
// Agent should use its own configured interval, which by default will be 30s. The Server may not
// know the configured interval and should not make assumptions about it.
// Status: [Development]
AgentCapabilities_ReportsHeartbeat = 0x00002000;
// Add new capabilities here, continuing with the least significant unused bit.
}

Expand Down
88 changes: 80 additions & 8 deletions specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ Status: [Beta]
- [OpAMPConnectionSettings.destination_endpoint](#opampconnectionsettingsdestination_endpoint)
- [OpAMPConnectionSettings.headers](#opampconnectionsettingsheaders)
- [OpAMPConnectionSettings.certificate](#opampconnectionsettingscertificate)
- [OpAMPConnectionSettings.heartbeat_interval_seconds](#opampconnectionsettingsheartbeat_interval_seconds)
+ [TelemetryConnectionSettings](#telemetryconnectionsettings)
- [TelemetryConnectionSettings.destination_endpoint](#telemetryconnectionsettingsdestination_endpoint)
- [TelemetryConnectionSettings.headers](#telemetryconnectionsettingsheaders)
Expand Down Expand Up @@ -229,6 +230,7 @@ OpAMP supports the following functionality:
[OTLP](https://opentelemetry.io/docs/specs/otlp/)-compatible
backend to monitor Agent's process metrics such as CPU or RAM usage, as well
as Agent-specific metrics such as rate of data processing.
* Agent heartbeating.
* Management of downloadable Agent-specific packages.
* Secure auto-updating capabilities (both upgrading and downgrading of the
Agents).
Expand Down Expand Up @@ -357,8 +359,7 @@ The format of each WebSocket message is the following:
```

The unencoded `header` is a 64 bit unsigned integer. In the WebSocket message the 64 bit
unencoded `header` value is encoded into bytes using [Base 128 Varint](
https://developers.google.com/protocol-buffers/docs/encoding#varints) format. The
unencoded `header` value is encoded into bytes using [Base 128 Varint](https://developers.google.com/protocol-buffers/docs/encoding#varints) format. The
number of the bytes that the encoded `header` uses depends on the value of unencoded
`header` and can be anything between 1 and 10 bytes.

Expand All @@ -369,8 +370,7 @@ compliant with this specification SHOULD check that the value of the `header` is
to 0 and if it is not SHOULD assume that the WebSocket message is malformed.

The `data` field contains the bytes that represent the AgentToServer or ServerToAgent
message encoded in [Protobuf binary wire format](
https://developers.google.com/protocol-buffers/docs/encoding).
message encoded in [Protobuf binary wire format](https://developers.google.com/protocol-buffers/docs/encoding).

Note that both `header` and `data` fields contain a variable number of bytes.
The decoding Base 128 Varint algorithm for the `header` knows when to stop based on the
Expand Down Expand Up @@ -417,6 +417,13 @@ message may also be sent by the Client in response to the Server making a remote
configuration offer to the Agent and Agent reporting that it accepted the
configuration.

If the Client is capable of sending heartbeats the Client SHOULD set
ReportsHeartbeat capability. If ReportsHeartbeat capability is set the
Client SHOULD send heartbeats periodically. The interval between the
heartbeats SHOULD be 30 seconds, unless a different value is configured
on the Client side or unless a different interval is offered by the Server via
`OpAMPConnectionSettings.heartbeat_interval_seconds` field.

See sections under the [Operation](#operation) section for the details of the
message sequences.

Expand Down Expand Up @@ -444,9 +451,13 @@ deliver to the Agent (such as for example a new remote configuration).

The default polling interval when the Agent does not have anything to deliver is 30
seconds. This polling interval SHOULD be configurable on the Client.
If the client has previously received and accepted OpAMP connection settings
then the value of `OpAMPConnectionSettings.heartbeat_interval_seconds`
SHOULD be used as the polling interval.

When using HTTP transport the sequence of messages is exactly the same as it is
when using the WebSocket transport. The only difference is in the timing:

- When the Server wants to send a message to the Agent, the Server needs to wait
for the Client to poll the Server and establish an HTTP request over which the Server's
message can be sent back as an HTTP response.
Expand Down Expand Up @@ -579,7 +590,13 @@ enum AgentCapabilities {
ReportsHealth = 0x00000800;
// The Agent will report RemoteConfig status via AgentToServer.remote_config_status field.
ReportsRemoteConfig = 0x00001000;
// The Agent can report heartbeats.
// This is specified by the ServerToAgent.OpAMPConnectionSettings.heartbeat_interval_seconds field.
// If this capability is true, but the Server does not set a heartbeat_interval_seconds field, the
// Agent should use its own configured interval, which by default will be 30s. The Server may not
// know the configured interval and should not make assumptions about it.
// Status: [Development]
ReportsHeartbeat = 0x00002000;
// Add new capabilities here, continuing with the least significant unused bit.
}
```
Expand Down Expand Up @@ -923,7 +940,7 @@ message ServerToAgentCommand {
```

The ServerToAgentCommand message is sent when the Server wants the Agent to restart.
This message must only contain the command, instance_uid, and capabilities fields. All other fields
This message must only contain the command, instance_uid, and capabilities fields. All other fields
will be ignored.

## Operation
Expand Down Expand Up @@ -1127,8 +1144,8 @@ runs.
The following attributes SHOULD be included:

- os.type, os.version - to describe where the Agent runs.
- host.* to describe the host the Agent runs on.
- cloud.* to describe the cloud where the host is located.
- host.\* to describe the host the Agent runs on.
- cloud.\* to describe the cloud where the host is located.
- any other relevant Resource attributes that describe this Agent and the
environment it runs in.
- any user-defined attributes that the end user would like to associate with
Expand Down Expand Up @@ -1606,6 +1623,7 @@ connection types.
```

The sequence is the following:

- (1) The Client connects to the Server. The Client SHOULD use regular TLS and validate
the Server's identity. The Agent may also use a bootstrap client certificate that is
already trusted by the Server. (Note: the distribution and installation method of
Expand Down Expand Up @@ -1829,6 +1847,7 @@ message OpAMPConnectionSettings {
string destination_endpoint = 1;
Headers headers = 2;
TLSCertificate certificate = 3;
uint64 heartbeat_interval_seconds = 4;
}
```

Expand All @@ -1854,6 +1873,56 @@ for this connection.
This field is optional: if omitted the client SHOULD NOT use a client-side certificate.
This field can be used to perform a client certificate revocation/rotation.

##### OpAMPConnectionSettings.heartbeat_interval_seconds

Status: [Development]

If the ReportsHeartbeat capability is true, the Client MUST use the offered heartbeat
interval to periodically send an AgentToServer message. If the capability is true
and the Server sets heartbeat_interval_seconds to 0, Agent heartbeats should be disabled.
At a minimum the `AgentToServer.instance_uid` field MUST be set in the heartbeats.
An HTTP-based client MUST use the heartbeat interval as its polling interval.

Any AgentToServer message where instance_uid field is set is considered a
valid heartbeat. Note that it is not necessary to send a separate AgentToServer
message just for heartbeating purposes if another AgentToServer message
containing other data was just sent. The Agent must count heartbeating interval
from the last AgentToServer message sent.

A heartbeat is used to keep a connection active and inform the server that
the Agent is still alive and active. A server could use the heartbeat to make decisions about
the liveness of the connected Agent.

The flow for negotiating a heartbeat is described as so:

```
┌──────────┐ ┌──────────┐
│ │ (1) Connect │ │
│ ├──────────────────────►│ │
│ │ │ │
│ │ (2) Set Heartbeat │ │
│ │◄──────────────────────┤ │
│ │ Interval │ │
│ │ │ │
│ Agent │ (3) Send Heartbeat │ Server │
│ ├──────────────────────►│ │
│ │ │ │
│ │ ... heartbeat │ │
│ │ interval │ │
│ │ │ │
│ │ (4) Send Heartbeat │ │
│ ├──────────────────────►│ │
│ │ │ │
└──────────┘ └──────────┘
```

1. The agent connects to the server and optionally sets the ReportsHeartbeat capability. If the Agent does not set this capability, the Server should not expect to receive heartbeats.
2. If the Agent sets the ReportsHeartbeat capability, the server MAY respond by setting an interval in the heartbeat_interval_seconds field within the OpAMPConnectionSettings message. The value can either be the desired interval, or `0`, indicating that the client should not send heartbeats. 30s is the recommended default interval.
3. If the Agent sets the ReportsHeartbeat capability AND the server hasn't disabled heartbeats, the Agent MUST send a heartbeat message every period, specified by the interval set by the server or using the agent's configured heartbeat interval.
4. The Agent will continue to send heartbeats on its configured interval while alive.

The Agent can decide not to send heartbeats by not setting the ReportsHeartbeat capability. The Server can decide to not receive heartbeats by responding with a value of `0` seconds in the OpAMPConnectionSettings.heartbeat_interval_seconds field.

#### TelemetryConnectionSettings

The TelemetryConnectionSettings message is a collection of fields which comprise an
Expand Down Expand Up @@ -2941,6 +3010,9 @@ response and MAY optionally set
header to indicate when SHOULD the Client attempt to reconnect. The Client SHOULD
honour the corresponding requirements of HTTP specification.

Note: a Retry-After header SHOULD be used only for the client's attempts to reconnect to the server.
A client should not attempt to send regular [heartbeat](#opampconnectionsettingsheartbeat_interval_seconds) messages while the Agent is reconnecting.

The minimum recommended retry interval is 30 seconds.

## Security
Expand Down

0 comments on commit 58acf6b

Please sign in to comment.