Allow the otel collector to live without a server to manage it #33275

pabloem · 2024-05-29T01:43:57Z

Description: An OpAMP supervisor+agent need to be able to work without an opamp server to manage them.

Testing: Added passing e2e test. All other e2e tests continue to pass.

Documentation: None added so far.

cmd/opampsupervisor/supervisor/supervisor.go

cmd/opampsupervisor/e2e_test.go

cmd/opampsupervisor/supervisor/supervisor.go

evan-bradley

Overall looks good to me, thanks for adding this and sorry for the review delay.

Could you add an enhancement changelog entry for this with make chlog-new?

evan-bradley · 2024-06-05T20:48:46Z

cmd/opampsupervisor/e2e_test.go

+	s := newSupervisor(t, "accepts_conn", map[string]string{"url": initialServer.addr})
+	defer s.Shutdown()
+
+	time.Sleep(11 * time.Second) // We wait until the supervisor gives up on connecting


Not the cleanest solution, but could we make the timeout a private field on the Supervisor and update it for this test? Adding 10 seconds to the execution times for the test will be a bit rough going forward.

I added a clockimpl attribute. Maybe too bold? : ) - we could not change a private field that we change because we initialize the supervisor and trigger the timer in NewSupervisor.

Sorry, missed this. I see the issue now, since Supervisor is in a separate package, we don't have access to private fields.

Could we then add this to the OpAMPServer config object as something like initial_connection_timeout with a default of 10 seconds? That way we can set it from the config file.

I think the point at which the timer is triggered shouldn't be much of an issue, but if it is there should be at least be a way to delay this test to wait for it.

when we add a configuration parameter, we're actually changing a user-facing feature of the supervisor, and we risk users depending on it. At the moment, this is a test-related capability and it might be better to keep it that way?

We could add a parameter to NewSupervisor that is similar to initial_connection_timeout instead. WDYT? A ClockImpl has more consequences, but it's a more general solution for other tests...

My personal opinion is that:

A parameter to NewSupervisor is preferrable from a config parameter

A ClockImpl or an initial_connection_timeout are... about equally preferrable.

WDYT?

when we add a configuration parameter, we're actually changing a user-facing feature of the supervisor, and we risk users depending on it. At the moment, this is a test-related capability and it might be better to keep it that way?

I agree we shouldn't expose test-related configuration options to users, thanks for calling that out. My intention was actually for this to be a real option available to users. I think it makes sense that someone may want to tune the amount of time the Supervisor waits before it starts a Collector when it's unable to connect to the OpAMP server; you may want to fail fast and get something going, or may want to wait longer because network conditions can be unreliable. Since we're adding the ability for the Collector to be started if the server can't be reached, I think this PR would also be a decent place to add that option. Do you think that's reasonable?

cmd/opampsupervisor/e2e_test.go

cmd/opampsupervisor/supervisor/supervisor.go

Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>

pabloem · 2024-06-10T14:30:56Z

@evan-bradley PTAL : D

cmd/opampsupervisor/supervisor/supervisor.go

evan-bradley · 2024-06-10T20:23:21Z

cmd/opampsupervisor/e2e_test.go

+	defer s.Shutdown()
+	time.Sleep(2 * time.Second) // We wait until the supervisor gives up on connecting
+	require.False(t, connectedToServer.Load(), "Collector connected to server before server was started")
+	require.True(t, s.GetAgentDescription() != nil, "Agent description was not received, so agent may not have started.")


Instead of exposing a public method, could we check that the Collector reports as healthy? You can see an example in TestSupervisorRestartCommand. It's a bit more code, but will allow us to forgo the additional method and I think is a bit more of a direct test that the Collector is live.

the code in that test relies on our own opamp server receiving a connection from the agent. in our test, we don't have a server initially, right? The only server is implemented inside the supervisor. That's why I added the public method - to expose state from the internal server implemented by the supervisor.

I agree that a new public method feels like too much - but I don't know another reasonable way to get that internal state. Thoughts?
on the other hand, the Supervisor component is not meant to be a library, so we could have a lower bar for allowing changes to its public api.

Thanks, that's right, we don't have a connection to our server here like we do in the other tests.

The alternative I would see here is that we start the Collector with a config where we can contact it. Could we start a Collector with a basic pipeline to verify it starts then? If we take advantage of the Collector's persistent state functionality, we could generate a config like in TestSupervisorStartsCollectorWithRemoteConfig that reads from input/output files to verify it starts. I think this would also help verify what the Collector is running when it starts without the server; it's not 100% clear to me right now based on the current test what the Collector is running when it is started. What do you think?

I agree that a new public method feels like too much - but I don't know another reasonable way to get that internal state. Thoughts?
on the other hand, the Supervisor component is not meant to be a library, so we could have a lower bar for allowing changes to its public api.

Long-term the Supervisor is intended to be a library, but I think it's okay to bend the rules a little while it's in development. If there's some blocking issue with the approach above I would be alright doing this for now and coming back to it later.

github-actions · 2024-06-26T05:19:53Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

BinaryFissionGames · 2024-07-09T17:16:40Z

cmd/opampsupervisor/supervisor/supervisor.go

 	if connErr := s.waitForOpAMPConnection(); connErr != nil {
-		return nil, fmt.Errorf("failed to connect to the OpAMP server: %w", connErr)
+		logger.Debug("failed to connect to the OpAMP server.", zap.Error(connErr))


Do we even need waitForOpAMPConnection anymore? It looks like we don't do anything special other than potentially wait 10 seconds. It seems like we could just remove this function entirely and have the same result.

BinaryFissionGames · 2024-07-09T20:12:55Z

@pabloem are you still working on this? This fix is great, would love to get this in.

djaglowski · 2024-07-17T13:38:09Z

Since there hasn't been any progress on this in over a month, do you want to put up another PR @BinaryFissionGames?

BinaryFissionGames · 2024-07-17T13:45:52Z

Yeah, I'd be happy to!

github-actions · 2024-08-01T05:20:13Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

cforce · 2024-08-01T13:23:04Z

@atoulme @tigrannajaryan Can someone please review 🙏

BinaryFissionGames · 2024-08-01T14:08:36Z

@cforce
We have a successor to the PR here that you can follow: #34159

…vailable (#34159) **Description:** <Describe what has changed.> * If the OpAMP server can't be contacted, the supervisor should still be run * This PR also fixes #33799 (as it removes the channel that is blocked on, prevent the reconnect) This PR supercedes #33275 **Link to tracking Issue:** Fixes #33408, #33799 **Testing:** * Added an e2e test for the behavior * Manually tested against BindPlane OP

evan-bradley · 2024-08-01T18:14:58Z

Closing in favor of #34159

Allow the otel collector to live without a server to manage it

9ad2f4b

pabloem requested review from evan-bradley, atoulme and tigrannajaryan as code owners May 29, 2024 01:43

pabloem requested a review from a team May 29, 2024 01:43

github-actions bot assigned songy23 May 29, 2024

github-actions bot added the cmd/opampsupervisor label May 29, 2024

jaronoff97 reviewed May 29, 2024

View reviewed changes

cmd/opampsupervisor/supervisor/supervisor.go Show resolved Hide resolved

jaronoff97 reviewed May 29, 2024

View reviewed changes

cmd/opampsupervisor/e2e_test.go Outdated Show resolved Hide resolved

cmd/opampsupervisor/supervisor/supervisor.go Show resolved Hide resolved

cmd/opampsupervisor/supervisor/supervisor.go Show resolved Hide resolved

Addressing comments on test

b39fc0e

jaronoff97 approved these changes May 29, 2024

View reviewed changes

evan-bradley reviewed Jun 5, 2024

View reviewed changes

pabloem and others added 2 commits June 6, 2024 10:07

Update cmd/opampsupervisor/supervisor/supervisor.go

c5be33a

Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>

Addressing comments

9f59795

evan-bradley reviewed Jun 10, 2024

View reviewed changes

github-actions bot added the Stale label Jun 26, 2024

cforce mentioned this pull request Jun 26, 2024

supervisor does no retry to connect to opamp server forever #33408

Closed

BinaryFissionGames reviewed Jul 9, 2024

View reviewed changes

github-actions bot removed the Stale label Jul 10, 2024

BinaryFissionGames mentioned this pull request Jul 18, 2024

[cmd/opampsupervisor]: Don't fail to start if the OpAMP server is unavailable #34159

Merged

github-actions bot added the Stale label Aug 1, 2024

evan-bradley closed this Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow the otel collector to live without a server to manage it #33275

Allow the otel collector to live without a server to manage it #33275

pabloem commented May 29, 2024

evan-bradley left a comment

evan-bradley Jun 5, 2024

pabloem Jun 10, 2024

evan-bradley Jun 10, 2024

evan-bradley Jun 10, 2024

pabloem Jun 11, 2024

evan-bradley Jun 11, 2024

pabloem commented Jun 10, 2024

evan-bradley Jun 10, 2024

pabloem Jun 11, 2024

evan-bradley Jun 11, 2024

github-actions bot commented Jun 26, 2024

BinaryFissionGames Jul 9, 2024

BinaryFissionGames commented Jul 9, 2024

djaglowski commented Jul 17, 2024

BinaryFissionGames commented Jul 17, 2024

github-actions bot commented Aug 1, 2024

cforce commented Aug 1, 2024

BinaryFissionGames commented Aug 1, 2024

evan-bradley commented Aug 1, 2024

Allow the otel collector to live without a server to manage it #33275

Allow the otel collector to live without a server to manage it #33275

Conversation

pabloem commented May 29, 2024

evan-bradley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Jun 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 26, 2024

Choose a reason for hiding this comment

BinaryFissionGames commented Jul 9, 2024

djaglowski commented Jul 17, 2024

BinaryFissionGames commented Jul 17, 2024

github-actions bot commented Aug 1, 2024

cforce commented Aug 1, 2024

BinaryFissionGames commented Aug 1, 2024

evan-bradley commented Aug 1, 2024