[cmd/opampsupervisor]: Supervisor waits for configurable healthchecks to report remote config status #34907

srikanthccv · 2024-08-28T13:53:22Z

Description:

This pull request addresses the remote config status reporting issue discussed in #21079 by introducing the following options to the Agent config:

config_apply_timeout: config update is successful if we receive a healthy status and then observe no failure updates for the entire duration of the timeout period; otherwise, failure is reported.

Link to tracking Issue: #21079

Testing: Added e2e test

Documentation:

… to report remote config status

cmd/opampsupervisor/supervisor/config/config.go

cmd/opampsupervisor/supervisor/supervisor.go

cmd/opampsupervisor/e2e_test.go

…llector-contrib into issue_21079

cmd/opampsupervisor/supervisor/supervisor.go

cmd/opampsupervisor/supervisor/config/config.go

cmd/opampsupervisor/e2e_test.go

…llector-contrib into issue_21079

cmd/opampsupervisor/supervisor/config/config_test.go

cmd/opampsupervisor/supervisor/config/config.go

cmd/opampsupervisor/e2e_test.go

cmd/opampsupervisor/supervisor/supervisor.go

…llector-contrib into issue_21079

srikanthccv · 2024-10-16T06:43:51Z

Looking into unit-test failures

evan-bradley · 2024-10-16T18:37:08Z

@srikanthccv With #35488 we're going to have the ability for the Collector to report its own health through the OpAMP extension. Do you think we could use this to get rid of the successful_health_checks option and rely on config_apply_timeout for success/failure reporting? Something like: if a success is reported, wait for the duration set in config_apply_timeout before reporting success to the server in case that status changes in that time. If a failure is reported, immediately report a failure, and if no status is reported by the end of the timeout period, report a failure.

srikanthccv · 2024-10-17T14:57:20Z

I have a question. Let's consider a scenario involving asynchronous errors resulting from a configuration update. The supervisor has applied the new effective configuration and restarted the agent. The collector initially starts without issues, but shortly after, some component reports an asynchronous error, which causes the collector to shut down. In this case, should we categorize this as a successful or failed configuration update? would extension not report any status in this situation?

evan-bradley · 2024-10-17T15:56:05Z

In this case, should we categorize this as a successful or failed configuration update?

I think this depends on the timeout duration. If it's within the timeout, it would be a failed configuration update, otherwise we would have already that the config was successfully updated and report the issue separately since we don't directly know whether it was related to the config update or not.

would extension not report any status in this situation?

The healthcheck extension and OpAMP extension use the same health reporting mechanisms. Currently both only report errors that cause the Collector to be shutdown.

srikanthccv · 2024-10-17T22:12:00Z

Went through the service and collector initialization code for a better understanding. IIUC, In the case of an asynchronous error, the extension would initially report a healthy status, but then quickly switch to an unhealthy status during the shutdown process. Can we fully rely on the initial healthy status report? If we do, we might incorrectly flag an update as successful when it's actually failing. What do you think?

srikanthccv · 2024-10-17T22:30:44Z

We should be able to get rid of the successful_health_checks. We would consider an update successful only if we receive a healthy status and then observe no failure updates for the entire duration of the timeout period?

evan-bradley · 2024-10-17T22:39:12Z

We would consider an update successful only if we receive a healthy status and then observe no failure updates for the entire duration of the timeout period?

That was my thought as well. I think we can mitigate the case you outlined if we do this.

srikanthccv · 2024-10-17T22:48:27Z

Great, I'll proceed with implementing the change.

…llector-contrib into issue_21079

cmd/opampsupervisor/supervisor/config/config.go

…llector-contrib into issue_21079

djaglowski · 2024-10-31T14:49:15Z

Please resolve conflicts and we'll get this merged

… to report remote config status (open-telemetry#34907) **Description:** This pull request addresses the remote config status reporting issue discussed in open-telemetry#21079 by introducing the following options to the Agent config: 1. `config_apply_timeout`: config update is successful if we receive a healthy status and then observe no failure updates for the entire duration of the timeout period; otherwise, failure is reported. **Link to tracking Issue:** open-telemetry#21079 **Testing:** Added e2e test **Documentation:** <Describe the documentation added.>

[cmd/opampsupervisor]: Supervisor waits for configurable healthchecks…

697f96e

… to report remote config status

srikanthccv requested review from evan-bradley, atoulme and tigrannajaryan as code owners August 28, 2024 13:53

srikanthccv requested a review from a team August 28, 2024 13:53

Merge branch 'main' into issue_21079

9259bf6

github-actions bot assigned crobert-1 Aug 28, 2024

github-actions bot added the cmd/opampsupervisor label Aug 28, 2024

github-actions bot requested a review from BinaryFissionGames August 28, 2024 13:53

BinaryFissionGames reviewed Aug 28, 2024

View reviewed changes

cmd/opampsupervisor/supervisor/config/config.go Outdated Show resolved Hide resolved

cmd/opampsupervisor/supervisor/supervisor.go Outdated Show resolved Hide resolved

cmd/opampsupervisor/e2e_test.go Show resolved Hide resolved

srikanthccv added 6 commits August 31, 2024 02:38

Update agent config validation

a74ef23

Merge branch 'issue_21079' of github.com:srikanthccv/opentelemetry-co…

9f83104

…llector-contrib into issue_21079

Review comments

136b2b6

Frequent checks for subsequent asserts

8e85f7b

Resolve conflicts

3cabc23

Merge branch 'main' into issue_21079

8208ca2

srikanthccv requested a review from BinaryFissionGames September 11, 2024 15:48

BinaryFissionGames reviewed Sep 17, 2024

View reviewed changes

cmd/opampsupervisor/supervisor/supervisor.go Show resolved Hide resolved

cmd/opampsupervisor/supervisor/config/config.go Outdated Show resolved Hide resolved

cmd/opampsupervisor/e2e_test.go Show resolved Hide resolved

BinaryFissionGames mentioned this pull request Oct 1, 2024

[cmd/opampsupervisor]: Implement PackagesAvailable for upgrading agent #35503

Open

srikanthccv added 4 commits October 8, 2024 19:51

resolve conflicts

fd57ea9

resolve conflicts again

88da95f

Fix tests

04261f3

Merge branch 'issue_21079' of github.com:srikanthccv/opentelemetry-co…

6c7f617

…llector-contrib into issue_21079

srikanthccv requested a review from a team as a code owner October 8, 2024 16:06

Merge branch 'main' into issue_21079

dedd6a0

srikanthccv requested a review from BinaryFissionGames October 8, 2024 16:10

BinaryFissionGames suggested changes Oct 16, 2024

View reviewed changes

srikanthccv added 3 commits October 16, 2024 11:02

Merge branch 'main' into issue_21079

169d25f

Fix tests

8d56b88

Merge branch 'issue_21079' of github.com:srikanthccv/opentelemetry-co…

6752571

…llector-contrib into issue_21079

srikanthccv added 2 commits October 16, 2024 11:58

Remove unnecessary check

a149e5f

Add CHANGELOG entry

873c072

srikanthccv added 5 commits October 18, 2024 04:19

Merge branch 'main' into issue_21079

b9b4d20

Merge branch 'main' into issue_21079

38016ff

Use agent health from opamp extension for config status report

e317745

Merge branch 'issue_21079' of github.com:srikanthccv/opentelemetry-co…

fde058d

…llector-contrib into issue_21079

go mod tidy

165f1a2

djaglowski approved these changes Oct 22, 2024

View reviewed changes

evan-bradley reviewed Oct 29, 2024

View reviewed changes

cmd/opampsupervisor/supervisor/config/config.go Outdated Show resolved Hide resolved

srikanthccv added 5 commits October 31, 2024 01:29

Merge branch 'main' into issue_21079

b7dbd63

Remove health check interval option

2e0908d

Merge branch 'issue_21079' of github.com:srikanthccv/opentelemetry-co…

3a61771

…llector-contrib into issue_21079

Update config_test

4c43fed

Remove removed interval refs

89f143e

evan-bradley approved these changes Oct 31, 2024

View reviewed changes

Resolve conflicts

6e3f678

djaglowski merged commit 57caf5f into open-telemetry:main Nov 5, 2024
158 checks passed

github-actions bot added this to the next release milestone Nov 5, 2024

srikanthccv deleted the issue_21079 branch November 5, 2024 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cmd/opampsupervisor]: Supervisor waits for configurable healthchecks to report remote config status #34907

[cmd/opampsupervisor]: Supervisor waits for configurable healthchecks to report remote config status #34907

srikanthccv commented Aug 28, 2024 •

edited

Loading

srikanthccv commented Oct 16, 2024

evan-bradley commented Oct 16, 2024 •

edited

Loading

srikanthccv commented Oct 17, 2024

evan-bradley commented Oct 17, 2024

srikanthccv commented Oct 17, 2024 •

edited

Loading

srikanthccv commented Oct 17, 2024

evan-bradley commented Oct 17, 2024

srikanthccv commented Oct 17, 2024

djaglowski commented Oct 31, 2024

[cmd/opampsupervisor]: Supervisor waits for configurable healthchecks to report remote config status #34907

[cmd/opampsupervisor]: Supervisor waits for configurable healthchecks to report remote config status #34907

Conversation

srikanthccv commented Aug 28, 2024 • edited Loading

srikanthccv commented Oct 16, 2024

evan-bradley commented Oct 16, 2024 • edited Loading

srikanthccv commented Oct 17, 2024

evan-bradley commented Oct 17, 2024

srikanthccv commented Oct 17, 2024 • edited Loading

srikanthccv commented Oct 17, 2024

evan-bradley commented Oct 17, 2024

srikanthccv commented Oct 17, 2024

djaglowski commented Oct 31, 2024

srikanthccv commented Aug 28, 2024 •

edited

Loading

evan-bradley commented Oct 16, 2024 •

edited

Loading

srikanthccv commented Oct 17, 2024 •

edited

Loading