-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend AgentHealth message to accommodate component health #165
Comments
Discussed in WG meeting today. I believe @mwear plans take a look at the following:
|
Based on the discussion at the last WG meeting, I have experimented with two additional ways to model the Option 1: Enum to Stringmessage AgentHealth {
bool healthy = 1;
fixed64 start_time_unix_nano = 2;
string last_error = 3;
map<string, ComponentHealth> component_health_map = 4;
}
message ComponentHealth {
bool healthy = 1;
fixed64 timestamp = 2;
string status = 3;
string last_error = 4;
} Option 2: Recursive Messagemessage AgentHealth {
bool healthy = 1;
fixed64 timestamp = 2;
string status = 3;
string last_error = 4;
map<string, AgentHealth> agent_health_map = 5;
} I prefer the recursive representation by far. It allows us to cleanly model collector status in terms of component status. The idea being, that we'd like to determine the health of the collector by the health of its pipelines, which in turn can be determined from the health of the components in each pipeline. Below is an example of how this could work using the recursive Example Component Healthhealth := &protobufs.AgentHealth{
Healthy: false,
Timestamp: 1234,
Status: "StatusRecoverableError",
LastError: "Resource Exhausted",
AgentHealthMap: map[string]*protobufs.AgentHealth{
"pipeline:traces/0": {
Healthy: false,
Timestamp: 1234,
Status: "StatusRecoverableError",
LastError: "Resource Exhausted",
AgentHealthMap: map[string]*protobufs.AgentHealth{
"receiver:otlp": {
Healthy: true,
Timestamp: 100,
Status: "StatusOK",
LastError: "",
},
"processor:batch": {
Healthy: true,
Timestamp: 100,
Status: "StatusOK",
LastError: "",
},
"exporter:jaeger": {
Healthy: false,
Timestamp: 1234,
Status: "StatusRecoverableError",
LastError: "Resource Exhausted",
},
},
},
"pipeline:metrics/0": {
Healthy: true,
Timestamp: 200,
Status: "StatusOK",
LastError: "",
AgentHealthMap: map[string]*protobufs.AgentHealth{
"receiver:otlp": {
Healthy: true,
Timestamp: 100,
Status: "StatusOK",
LastError: "",
},
"processor:transform": {
Healthy: true,
Timestamp: 100,
Status: "StatusOK",
LastError: "",
},
"processor:batch": {
Healthy: true,
Timestamp: 100,
Status: "StatusOK",
LastError: "",
},
"exporter:otlp": {
Healthy: true,
Timestamp: 200,
Status: "StatusOK",
LastError: "",
},
},
},
},
} The discussion that follows is specific to how we'd like to use this for OTel. The process described will be semantic conventions for us and will not be compulsory for other uses of the protocol. As part of the component status reporting work: open-telemetry/opentelemetry-collector#8169 we have introduced rules for aggregating multiple statuses into a single effective status (see the At each level of aggregation we can use the timestamp and or error information from the most recent event that is representative of the current aggregated status. For non-error statuses we can use the most recent status event and for error statuses we can use the most recent event with the matching aggregated error status. To streamline this process, I introduced a function called In this example, the aggregate status of the The timestamp for the events represents the time of the state change into the current status. This is a little bit different use of the timestamp from the current AgentHealth message, which actually calls this |
A recursive approach would also be great for the operator group. We have been discussing distributed collector configuration in kubernetes and in that world we would probably want to report status for each component that makes up a collector group. Allowing us to represent the DAG of config we will be applying in kubernetes via this recursive message would enable just that. In this example, if an OTLP exporter CRD shared by multiple collector groups started failing, we would be able to represent it as a top level key in addition to the otel collector CRD. In this example, the bridge is reporting and healthy, however a collector CRD named
Beyond the operator's needs, in the future an agent would be able to report minimal change sets with this approach by specifying the keys for the component whose health changed. From Matt's example, if the metrics OTLP exporter suddenly had a problem the following could be reported:
Doing it this way would allow us to minimize messages sent, which would be a net benefit for users. |
Question: will we have Otel-wide semantic conventions that define the values possible for the A few random comments:
|
This is a good question. I don't have a definitive answer and I think it could go either way. We have a set of statuses we expect for the collector, but they are collector specific. We could say that the values are agent specific, but we could formalize the ones we know for the agents we care about. I think we should discuss this further and decide what to do. If I understand your other comments correctly, you are suggesting that we keep a version of the Recursive Option 2:The message AgentHealth {
bool healthy = 1;
fixed64 start_time_unix_nano = 2;
string last_error = 3;
string status = 4;
fixed64 timestamp = 5;
map<string, ComponentHealth> component_health_map = 6;
}
message ComponentHealth {
bool healthy = 1;
string status = 2;
string last_error = 3;
fixed64 timestamp = 4;
map<string, ComponentHealth> component_health_map = 5;
} Recursive Option 3:The message AgentHealth {
bool healthy = 1 [deprecated=true];
fixed64 start_time_unix_nano = 2;
string last_error = 3 [deprecated=true];
map<string, ComponentHealth> component_health_map = 4;
}
message ComponentHealth {
bool healthy = 1;
fixed64 timestamp = 2;
string status = 3;
string last_error = 4;
map<string, ComponentHealth> component_health_map = 5;
} Are either one of these on the right track? |
I would put in a vote for another variation, which is how I interpreted the suggestions @tigrannajaryan made: message ComponentHealth {
bool healthy = 1;
fixed64 start_time_unix_nano = 2;
string last_error = 3;
fixed64 timestamp = 4;
string status = 5;
map<string, ComponentHealth> component_health_map = 6;
} While we do lose the use of "agent" anywhere in the message, I think this provides the greatest flexibility. I think it is okay for agents and components of agents to both be called components, and I think there is validity in setting a start time for every component. As for semantic conventions for the statuses, while the Collector component statuses are intended for the Collector specifically, they should apply well to any networked system. At a minimum, we could likely establish a similar set of statuses that could be generalized to most agents. I think making the field a string should be sufficient for any agents who have non-conformant statuses to report these and have them handled by the server as agent-specific. |
This makes sense to me. I would imagine for the collector (for the time being) that the start time of its components will be the start time of the collector, but I could see this being useful in a system where components can be restarted. |
Yes, this looks better to me. We can probably also rename I would also would like to understand how the |
I can get a PR together to add this message, but I have a couple of questions about the Also in response to:
The |
I would advocate for taking advantage of the protocol's beta status to replace The beta definition linked to in the OpAMP spec seems to indicate it would be permissible to do this, do we have any obligations that would prevent it? |
I believe we need to replace AgentHealth with ComponentHealth. It is backward compatible manner from wire perspective since ComponentHealth is a superset of AgentHealth. It is not a breaking change even for old Servers, they should be able to receive the new ComponentHealth message, will ignore the new fields and will interpret the message correctly. Compare: message AgentHealth {
bool healthy = 1;
fixed64 start_time_unix_nano = 2;
string last_error = 3;
} with: message ComponentHealth {
bool healthy = 1;
fixed64 start_time_unix_nano = 2;
string last_error = 3;
fixed64 timestamp = 4;
string status = 5;
map<string, ComponentHealth> component_health_map = 6;
} It is a breaking change from the symbols perspective (the name of the message changes), but we are allowed to do it in Beta. |
@mwear is there anything else to do here or we can close the issue since the PR is merged? |
The work has been addressed by #168. I'll go ahead and close this issue. |
The addition of component status reporting to the collector will allow us to have a more nuanced view of collector health by having visibility into the health of individual components. At a recent SIG meeting we discussed the possibility of extending the
AgentHealth
message to include a map that can be used to express the health of multiple components. There are open questions as to how we want restructure the message and how we can ensure that it remains agent agnostic. Do we want to keep the currently definedAgentHealth
message for overall health, and add a map for component health, or do we want express everything as part of the map?I'll suggest the following as a starting point, but expect that it is probably too coupled to our ideas around collector health. We can refine this as needed. This also retains the current
AgentHealth
fields as overall health.I suspect dropping the status enum and replacing it with a string would be a good middle ground for making the ComponentHealth message more generalizable for non-collector agents. If anyone has thoughts or opinions, please comment.
The text was updated successfully, but these errors were encountered: