Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cmd/opampsupervisor]: Supervisor waits for configurable healthchecks to report remote config status #34907

Merged
merged 29 commits into from
Nov 5, 2024

Conversation

srikanthccv
Copy link
Member

@srikanthccv srikanthccv commented Aug 28, 2024

Description:

This pull request addresses the remote config status reporting issue discussed in #21079 by introducing the following options to the Agent config:

  1. config_apply_timeout: config update is successful if we receive a healthy status and then observe no failure updates for the entire duration of the timeout period; otherwise, failure is reported.

Link to tracking Issue: #21079

Testing: Added e2e test

Documentation:

@srikanthccv srikanthccv requested a review from a team as a code owner October 8, 2024 16:06
cmd/opampsupervisor/supervisor/config/config_test.go Outdated Show resolved Hide resolved
cmd/opampsupervisor/supervisor/config/config.go Outdated Show resolved Hide resolved
cmd/opampsupervisor/e2e_test.go Outdated Show resolved Hide resolved
cmd/opampsupervisor/supervisor/supervisor.go Outdated Show resolved Hide resolved
@srikanthccv
Copy link
Member Author

Looking into unit-test failures

@evan-bradley
Copy link
Contributor

evan-bradley commented Oct 16, 2024

@srikanthccv With #35488 we're going to have the ability for the Collector to report its own health through the OpAMP extension. Do you think we could use this to get rid of the successful_health_checks option and rely on config_apply_timeout for success/failure reporting? Something like: if a success is reported, wait for the duration set in config_apply_timeout before reporting success to the server in case that status changes in that time. If a failure is reported, immediately report a failure, and if no status is reported by the end of the timeout period, report a failure.

@srikanthccv
Copy link
Member Author

I have a question. Let's consider a scenario involving asynchronous errors resulting from a configuration update. The supervisor has applied the new effective configuration and restarted the agent. The collector initially starts without issues, but shortly after, some component reports an asynchronous error, which causes the collector to shut down. In this case, should we categorize this as a successful or failed configuration update? would extension not report any status in this situation?

@evan-bradley
Copy link
Contributor

In this case, should we categorize this as a successful or failed configuration update?

I think this depends on the timeout duration. If it's within the timeout, it would be a failed configuration update, otherwise we would have already that the config was successfully updated and report the issue separately since we don't directly know whether it was related to the config update or not.

would extension not report any status in this situation?

The healthcheck extension and OpAMP extension use the same health reporting mechanisms. Currently both only report errors that cause the Collector to be shutdown.

@srikanthccv
Copy link
Member Author

srikanthccv commented Oct 17, 2024

Went through the service and collector initialization code for a better understanding. IIUC, In the case of an asynchronous error, the extension would initially report a healthy status, but then quickly switch to an unhealthy status during the shutdown process. Can we fully rely on the initial healthy status report? If we do, we might incorrectly flag an update as successful when it's actually failing. What do you think?

@srikanthccv
Copy link
Member Author

We should be able to get rid of the successful_health_checks. We would consider an update successful only if we receive a healthy status and then observe no failure updates for the entire duration of the timeout period?

@evan-bradley
Copy link
Contributor

We would consider an update successful only if we receive a healthy status and then observe no failure updates for the entire duration of the timeout period?

That was my thought as well. I think we can mitigate the case you outlined if we do this.

@srikanthccv
Copy link
Member Author

Great, I'll proceed with implementing the change.

@djaglowski
Copy link
Member

Please resolve conflicts and we'll get this merged

@djaglowski djaglowski merged commit 57caf5f into open-telemetry:main Nov 5, 2024
158 checks passed
@github-actions github-actions bot added this to the next release milestone Nov 5, 2024
@srikanthccv srikanthccv deleted the issue_21079 branch November 5, 2024 14:56
michael-burt pushed a commit to michael-burt/opentelemetry-collector-contrib that referenced this pull request Nov 7, 2024
… to report remote config status (open-telemetry#34907)

**Description:** 

This pull request addresses the remote config status reporting issue
discussed in open-telemetry#21079 by introducing the following options to the Agent
config:

1. `config_apply_timeout`: config update is successful if we receive a
healthy status and then observe no failure updates for the entire
duration of the timeout period; otherwise, failure is reported.

**Link to tracking Issue:** open-telemetry#21079

**Testing:** Added e2e test

**Documentation:** <Describe the documentation added.>
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this pull request Dec 17, 2024
… to report remote config status (open-telemetry#34907)

**Description:** 

This pull request addresses the remote config status reporting issue
discussed in open-telemetry#21079 by introducing the following options to the Agent
config:

1. `config_apply_timeout`: config update is successful if we receive a
healthy status and then observe no failure updates for the entire
duration of the timeout period; otherwise, failure is reported.

**Link to tracking Issue:** open-telemetry#21079

**Testing:** Added e2e test

**Documentation:** <Describe the documentation added.>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants