Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the otel collector to live without a server to manage it #33275

Closed
wants to merge 4 commits into from

Conversation

pabloem
Copy link
Contributor

@pabloem pabloem commented May 29, 2024

Description: An OpAMP supervisor+agent need to be able to work without an opamp server to manage them.

Testing: Added passing e2e test. All other e2e tests continue to pass.

Documentation: None added so far.

Copy link
Contributor

@evan-bradley evan-bradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, thanks for adding this and sorry for the review delay.

Could you add an enhancement changelog entry for this with make chlog-new?

s := newSupervisor(t, "accepts_conn", map[string]string{"url": initialServer.addr})
defer s.Shutdown()

time.Sleep(11 * time.Second) // We wait until the supervisor gives up on connecting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the cleanest solution, but could we make the timeout a private field on the Supervisor and update it for this test? Adding 10 seconds to the execution times for the test will be a bit rough going forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a clockimpl attribute. Maybe too bold? : ) - we could not change a private field that we change because we initialize the supervisor and trigger the timer in NewSupervisor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed this. I see the issue now, since Supervisor is in a separate package, we don't have access to private fields.

Could we then add this to the OpAMPServer config object as something like initial_connection_timeout with a default of 10 seconds? That way we can set it from the config file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point at which the timer is triggered shouldn't be much of an issue, but if it is there should be at least be a way to delay this test to wait for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we add a configuration parameter, we're actually changing a user-facing feature of the supervisor, and we risk users depending on it. At the moment, this is a test-related capability and it might be better to keep it that way?

We could add a parameter to NewSupervisor that is similar to initial_connection_timeout instead. WDYT? A ClockImpl has more consequences, but it's a more general solution for other tests...

My personal opinion is that:

  • A parameter to NewSupervisor is preferrable from a config parameter
  • A ClockImpl or an initial_connection_timeout are... about equally preferrable.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we add a configuration parameter, we're actually changing a user-facing feature of the supervisor, and we risk users depending on it. At the moment, this is a test-related capability and it might be better to keep it that way?

I agree we shouldn't expose test-related configuration options to users, thanks for calling that out. My intention was actually for this to be a real option available to users. I think it makes sense that someone may want to tune the amount of time the Supervisor waits before it starts a Collector when it's unable to connect to the OpAMP server; you may want to fail fast and get something going, or may want to wait longer because network conditions can be unreliable. Since we're adding the ability for the Collector to be started if the server can't be reached, I think this PR would also be a decent place to add that option. Do you think that's reasonable?

cmd/opampsupervisor/e2e_test.go Show resolved Hide resolved
cmd/opampsupervisor/supervisor/supervisor.go Show resolved Hide resolved
pabloem and others added 2 commits June 6, 2024 10:07
Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
@pabloem
Copy link
Contributor Author

pabloem commented Jun 10, 2024

@evan-bradley PTAL : D

cmd/opampsupervisor/supervisor/supervisor.go Show resolved Hide resolved
defer s.Shutdown()
time.Sleep(2 * time.Second) // We wait until the supervisor gives up on connecting
require.False(t, connectedToServer.Load(), "Collector connected to server before server was started")
require.True(t, s.GetAgentDescription() != nil, "Agent description was not received, so agent may not have started.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exposing a public method, could we check that the Collector reports as healthy? You can see an example in TestSupervisorRestartCommand. It's a bit more code, but will allow us to forgo the additional method and I think is a bit more of a direct test that the Collector is live.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code in that test relies on our own opamp server receiving a connection from the agent. in our test, we don't have a server initially, right? The only server is implemented inside the supervisor. That's why I added the public method - to expose state from the internal server implemented by the supervisor.

I agree that a new public method feels like too much - but I don't know another reasonable way to get that internal state. Thoughts?
on the other hand, the Supervisor component is not meant to be a library, so we could have a lower bar for allowing changes to its public api.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's right, we don't have a connection to our server here like we do in the other tests.

The alternative I would see here is that we start the Collector with a config where we can contact it. Could we start a Collector with a basic pipeline to verify it starts then? If we take advantage of the Collector's persistent state functionality, we could generate a config like in TestSupervisorStartsCollectorWithRemoteConfig that reads from input/output files to verify it starts. I think this would also help verify what the Collector is running when it starts without the server; it's not 100% clear to me right now based on the current test what the Collector is running when it is started. What do you think?

I agree that a new public method feels like too much - but I don't know another reasonable way to get that internal state. Thoughts?
on the other hand, the Supervisor component is not meant to be a library, so we could have a lower bar for allowing changes to its public api.

Long-term the Supervisor is intended to be a library, but I think it's okay to bend the rules a little while it's in development. If there's some blocking issue with the approach above I would be alright doing this for now and coming back to it later.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Comment on lines 194 to +195
if connErr := s.waitForOpAMPConnection(); connErr != nil {
return nil, fmt.Errorf("failed to connect to the OpAMP server: %w", connErr)
logger.Debug("failed to connect to the OpAMP server.", zap.Error(connErr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need waitForOpAMPConnection anymore? It looks like we don't do anything special other than potentially wait 10 seconds. It seems like we could just remove this function entirely and have the same result.

@BinaryFissionGames
Copy link
Contributor

@pabloem are you still working on this? This fix is great, would love to get this in.

@github-actions github-actions bot removed the Stale label Jul 10, 2024
@djaglowski
Copy link
Member

Since there hasn't been any progress on this in over a month, do you want to put up another PR @BinaryFissionGames?

@BinaryFissionGames
Copy link
Contributor

Yeah, I'd be happy to!

Copy link
Contributor

github-actions bot commented Aug 1, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Aug 1, 2024
@cforce
Copy link

cforce commented Aug 1, 2024

@atoulme @tigrannajaryan Can someone please review 🙏

@BinaryFissionGames
Copy link
Contributor

@cforce
We have a successor to the PR here that you can follow: #34159

evan-bradley pushed a commit that referenced this pull request Aug 1, 2024
…vailable (#34159)

**Description:** <Describe what has changed.>
* If the OpAMP server can't be contacted, the supervisor should still be
run
* This PR also fixes #33799 (as it removes the channel that is blocked
on, prevent the reconnect)

This PR supercedes #33275

**Link to tracking Issue:** Fixes #33408, #33799

**Testing:**
* Added an e2e test for the behavior
* Manually tested against BindPlane OP
@evan-bradley
Copy link
Contributor

Closing in favor of #34159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants