Component status reporting #7682

djaglowski · 2023-05-16T15:00:25Z

Is your feature request related to a problem? Please describe.
We should formalize the notion of component status, so that the state/health of each component in the collector may be understood independently.

Describe the solution you'd like
#6560 is a good starting point.

Beyond establishing a system for publishing and subscribing to component status, we should enumerate the possible states and define a finite state diagram that governs when components transition from one state to another.

seankhliao · 2023-05-28T21:46:39Z

I feel like reporting the resolved/hydrated config should also be part of the status it reports (for zpages / #5223)

mwear · 2023-07-12T15:20:38Z

I can work on this.

@tigrannajaryan

This PR introduces component status reporting. There have been several attempts to introduce this functionality previously, with the most recent being: #6560. This PR was orignally based off of #6560, but has evolved based on the feedback received and some additional enhancements to improve the ease of use of the `ReportComponentStatus` API. In earlier discussions (see #8169 (comment)) we decided to model status as a finite state machine with the following statuses: `Starting`, `OK`, `RecoverableError`, `PermanentError`, `FatalError`. `Stopping`, and `Stopped`. A benefit of this design is that `StatusWatcher`s will be notified on changes in status rather than on potentially repetitive reports of the same status. With the additional statuses and modeling them using a finite state machine, there are more statuses to report. Rather than having each component be responsible for reporting all of the statuses, I automated status reporting where possible. A component's status will automatically be set to `Starting` at startup. If the components `Start` returns an error, the status will automatically be set to `PermanentError`. A component is expected to report `StatusOK` when it has successfully started (if it has successfully started) and from there can report changes in status as it runs. It will likely be a common scenario for components to transition between `StatusOK` and `StatusRecoverableError` during their lifetime. In extenuating circumstances they can transition into terminal states of `PermanentError` and `FatalError` (where a fatal error initiates collector shutdown). Additionally, during component Shutdown statuses are automatically reported where possible. A component's status is set to `Stopping` when Shutdown is initially called, if Shutdown returns an error, the status will be set to `PermanentError` if it does not return an error, the status is set to `Stopped`. In #6560 ReportComponentStatus was implemented on the `Host` interface. I found that few components use the Host interface, and none of them save a handle to it (to be used outside of the `start` method). I found that many components keep a handle to the `TelemetrySettings` that they are initialized with, and this seemed like a more natural, convenient place for the `ReportComponentStatus` API. I'm ultimately flexible on where this method resides, but feel that `TelemetrySettings` a more user friendly place for it. Regardless of where the `ReportComponentStatus` method resides (Host or TelemetrySettings), there is a difference in the method signature for the API based on whether it is used from the service or from a component. As the service is not bound to a specific component, it needs to take the `instanceID` of a component as a parameter, whereas the component version of the method already knows the `instanceID`. In #6560 this led to having both `component.Host` and `servicehost.Host` versions of the Host interface to be used at the component or service levels. In this version, we have the same for TelemetrySettings. There is a `component.TelemetrySettings` and a `servicetelemetry.Settings` with the only difference being the method signature of `ReportComponentStatus`. Lastly, this PR sets up the machinery for report component status, and allows extensions to be `StatusWatcher`s, but it does not introduce any `StatusWatcher`s. We expect the OpAMP extension to be a `StatusWatcher` and use data from this system as part of its AgentHealth message (the message is currently being extended to accommodate more component level details). We also expect there to be a non-OpAMP `StatusWatcher` implementation, likely via the HealthCheck extension (or something similiar). **Link to tracking Issue:** #7682 cc: @tigrannajaryan @djaglowski @evan-bradley --------- Co-authored-by: Tigran Najaryan <tnajaryan@splunk.com> Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com> Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com> Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com> Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> Co-authored-by: Alex Boten <aboten@lightstep.com>

This is part of the continued component status reporting effort. Currently we have automated status reporting for the following component lifecycle events: `Starting`, `Stopping`, `Stopped` as well as definitive errors that occur in the starting or stopping process (e.g. as determined by an error return value). This leaves the responsibility to the component to report runtime status after start and before stop. We'd like to be able to extend the automatic status reporting to report `StatusOK` if `Start` completes without an error. One complication with this approach is that some components spawn async work (via goroutines) that, depending on the Go scheduler, can report status before `Start` returns. As such, we cannot assume a nil return value from `Start` means the component has started properly. The solution is to detect if the component has already reported status when start returns, if it has, we will use the component-reported status and will not automatically report status. If it hasn't, and `Start` returns without an error, we can report `StatusOK`. Any subsequent reports from the component (async or otherwise) will transition the component status accordingly. The tl;dr is that we cannot control the execution of async code, that's up to the Go scheduler, but we can handle the race, report the status based on the execution, and not clobber status reported from within the component during the startup process. That said, for components with async starts, you may see a `StatusOK` before the component-reported status, or just the component-reported status depending on the actual execution of the code. In both cases, the end status will be same. The work in this PR will allow us to simplify #8684 and #8788 and ultimately choose which direction we want to go for runtime status reporting. **Link to tracking Issue:** #7682 **Testing:** units / manual --------- Co-authored-by: Alex Boten <aboten@lightstep.com>

This is part of the continued component status reporting effort. Currently we have automated status reporting for the following component lifecycle events: `Starting`, `Stopping`, `Stopped` as well as definitive errors that occur in the starting or stopping process (e.g. as determined by an error return value). This leaves the responsibility to the component to report runtime status after start and before stop. We'd like to be able to extend the automatic status reporting to report `StatusOK` if `Start` completes without an error. One complication with this approach is that some components spawn async work (via goroutines) that, depending on the Go scheduler, can report status before `Start` returns. As such, we cannot assume a nil return value from `Start` means the component has started properly. The solution is to detect if the component has already reported status when start returns, if it has, we will use the component-reported status and will not automatically report status. If it hasn't, and `Start` returns without an error, we can report `StatusOK`. Any subsequent reports from the component (async or otherwise) will transition the component status accordingly. The tl;dr is that we cannot control the execution of async code, that's up to the Go scheduler, but we can handle the race, report the status based on the execution, and not clobber status reported from within the component during the startup process. That said, for components with async starts, you may see a `StatusOK` before the component-reported status, or just the component-reported status depending on the actual execution of the code. In both cases, the end status will be same. The work in this PR will allow us to simplify open-telemetry#8684 and open-telemetry#8788 and ultimately choose which direction we want to go for runtime status reporting. **Link to tracking Issue:** open-telemetry#7682 **Testing:** units / manual --------- Co-authored-by: Alex Boten <aboten@lightstep.com>

djaglowski assigned mwear Jul 12, 2023

mwear mentioned this issue Aug 2, 2023

Component Status Reporting #8169

Merged

mwear mentioned this issue Aug 17, 2023

Extend AgentHealth message to accommodate component health open-telemetry/opamp-spec#165

Closed

This was referenced Oct 13, 2023

Automatic status reporting for exporterhelper and core exporters #8684

Closed

Automated status reporting via typed errors #8709

Closed

mwear mentioned this issue Nov 1, 2023

Manually report status for core exporters #8788

Closed

mwear mentioned this issue Nov 9, 2023

Automate status reporting on start #8836

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Component status reporting #7682

Component status reporting #7682

djaglowski commented May 16, 2023

seankhliao commented May 28, 2023

mwear commented Jul 12, 2023

Component status reporting #7682

Component status reporting #7682

Comments

djaglowski commented May 16, 2023

seankhliao commented May 28, 2023

mwear commented Jul 12, 2023