Automated status reporting via typed errors #8709
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This PR is an alternative to #8684. It's currently a rough draft to get feedback on whether or not this is a good direction to pursue.
This PR adds default status reporting by wrapping the
Start
,Stop
, andConsume*
methods of connectors, exporters, processors, and receivers. The wrappers intercept the errors returned from the underlying methods and report status based on the error type. To give users more control over automated status reporting, it introducescomponenterror
types that match the error status types, however users should rarely need to use them. By default, if an the returned error is nil, StatusOK will be reported. ForStart
andStop
if the error is not acomponenterror
, it will be assumed to be permanent, and StatusPermanentError will be reported. ForConsume*
if an error is not a component error, it will be assumed to be recoverable, and StatusRecoverableError will be reported. Thecomponenterror
types will only be needed to override the default assumptions.This PR also adds support for automatically reporting
StatusOK
, onStart
. Previously this was left to the user due to complications withStart
methods that report status asynchronously (e.g. from a go routine). We can report status for components with synchronousStart
methods, but we handle the async case, likely by allowing them to opt-out of automatic status reporting. The opt-out mechanism is not yet been implemented. One other change aroundStart
is that a component can return an error of typecomponenterror.Recoverable
that will signal a problem starting, but won't halt the collector from starting (which is what it currently does if any error is returned).The list below outlines how automated status reporting will work for each of the wrapped methods.
StatusOK
componenterror
report the corresponding status for the error typecomponenterror
reportStatusRecoverableError
StatusOK
componenterror.RecoverableError
, continue starting other components and reportStatusRecoverableError
componenterror.Permament
orcomponenterror.Fatal
reportStatusPermanentError
orStatusFatalError
respectivelycomponenterror
, reportStatusPermanentError
(this will preserve the current behavior and halt startup)componenterror
report the corresponding statuscomponenterror
, reportStatusPermanentError
From within one of these well defined methods, returning different
componenterror
types can control the automated status reporting, however, this should rarely be needed. From outside these methods, theTelemetrySettings.ReportComponentStatus
API can be used directly (which will be the majority use case).Edit:
I would prefer not to introduce the
componenterror
types as they complicate status reporting to some degree, by giving users two ways to report status. It would be ideal only use the API. If we wanted to assume that errors return fromConsume*
are recoverable, and permanent fromStart
andShutdown
then we can avoid introducing thecomponenterror
types. Components can use the API to mark errors apermanent
orfatal
forConsume*
, but they would not be able to make an error as recoverable fromStart
. While being able to have a recoverable error from start is a nice to have, it's not a capability that we have to day.My concern about
Start
with components that have async start methods may not really play out in the real world. Here is an example of an async start from the zipkin receiver:In this example returning nil will cause automatic status reporting to report
StatusOK
. If the server fails to start it will reportStatusFatalError
using the API. The order of these calls will be indeterminate, however, the end result will end up being the same. It could happen thatStatusOK
is reported first, followed byStatusFatalError
, or it could happen thatStatusFatalError
is reported first, which would make reportingStatusOK
effectively a noop (because you can not transition state fromStatusFatalError
toStatusOK
). In both cases the end status of the component would beStatusFatalError
. The same would be true if the component reportedStatusPermanentError
.If, hypothetically, a component wanted to report
StatusRecoverableError
from within a go routine onStart
there would be a problem, as the recoverable error could be cleared by reportingStatusOK
. For this reason, I think we probably need an escape hatch of sorts, but I don't know that this problem actually exists in any of our code today.Link to tracking Issue: #7682
Testing: < Describe what testing was performed and which tests were added.>
Documentation: < Describe the documentation added.>