-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow the otel collector to live without a server to manage it #33275
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, thanks for adding this and sorry for the review delay.
Could you add an enhancement changelog entry for this with make chlog-new
?
cmd/opampsupervisor/e2e_test.go
Outdated
s := newSupervisor(t, "accepts_conn", map[string]string{"url": initialServer.addr}) | ||
defer s.Shutdown() | ||
|
||
time.Sleep(11 * time.Second) // We wait until the supervisor gives up on connecting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the cleanest solution, but could we make the timeout a private field on the Supervisor and update it for this test? Adding 10 seconds to the execution times for the test will be a bit rough going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a clockimpl attribute. Maybe too bold? : ) - we could not change a private field that we change because we initialize the supervisor and trigger the timer in NewSupervisor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, missed this. I see the issue now, since Supervisor
is in a separate package, we don't have access to private fields.
Could we then add this to the OpAMPServer
config object as something like initial_connection_timeout
with a default of 10 seconds? That way we can set it from the config file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point at which the timer is triggered shouldn't be much of an issue, but if it is there should be at least be a way to delay this test to wait for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we add a configuration parameter, we're actually changing a user-facing feature of the supervisor, and we risk users depending on it. At the moment, this is a test-related capability and it might be better to keep it that way?
We could add a parameter to NewSupervisor
that is similar to initial_connection_timeout
instead. WDYT? A ClockImpl has more consequences, but it's a more general solution for other tests...
My personal opinion is that:
- A parameter to
NewSupervisor
is preferrable from a config parameter - A ClockImpl or an initial_connection_timeout are... about equally preferrable.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we add a configuration parameter, we're actually changing a user-facing feature of the supervisor, and we risk users depending on it. At the moment, this is a test-related capability and it might be better to keep it that way?
I agree we shouldn't expose test-related configuration options to users, thanks for calling that out. My intention was actually for this to be a real option available to users. I think it makes sense that someone may want to tune the amount of time the Supervisor waits before it starts a Collector when it's unable to connect to the OpAMP server; you may want to fail fast and get something going, or may want to wait longer because network conditions can be unreliable. Since we're adding the ability for the Collector to be started if the server can't be reached, I think this PR would also be a decent place to add that option. Do you think that's reasonable?
Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
@evan-bradley PTAL : D |
defer s.Shutdown() | ||
time.Sleep(2 * time.Second) // We wait until the supervisor gives up on connecting | ||
require.False(t, connectedToServer.Load(), "Collector connected to server before server was started") | ||
require.True(t, s.GetAgentDescription() != nil, "Agent description was not received, so agent may not have started.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of exposing a public method, could we check that the Collector reports as healthy? You can see an example in TestSupervisorRestartCommand
. It's a bit more code, but will allow us to forgo the additional method and I think is a bit more of a direct test that the Collector is live.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code in that test relies on our own opamp server receiving a connection from the agent. in our test, we don't have a server initially, right? The only server is implemented inside the supervisor. That's why I added the public method - to expose state from the internal server implemented by the supervisor.
I agree that a new public method feels like too much - but I don't know another reasonable way to get that internal state. Thoughts?
on the other hand, the Supervisor component is not meant to be a library, so we could have a lower bar for allowing changes to its public api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's right, we don't have a connection to our server here like we do in the other tests.
The alternative I would see here is that we start the Collector with a config where we can contact it. Could we start a Collector with a basic pipeline to verify it starts then? If we take advantage of the Collector's persistent state functionality, we could generate a config like in TestSupervisorStartsCollectorWithRemoteConfig
that reads from input/output files to verify it starts. I think this would also help verify what the Collector is running when it starts without the server; it's not 100% clear to me right now based on the current test what the Collector is running when it is started. What do you think?
I agree that a new public method feels like too much - but I don't know another reasonable way to get that internal state. Thoughts?
on the other hand, the Supervisor component is not meant to be a library, so we could have a lower bar for allowing changes to its public api.
Long-term the Supervisor is intended to be a library, but I think it's okay to bend the rules a little while it's in development. If there's some blocking issue with the approach above I would be alright doing this for now and coming back to it later.
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
if connErr := s.waitForOpAMPConnection(); connErr != nil { | ||
return nil, fmt.Errorf("failed to connect to the OpAMP server: %w", connErr) | ||
logger.Debug("failed to connect to the OpAMP server.", zap.Error(connErr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need waitForOpAMPConnection
anymore? It looks like we don't do anything special other than potentially wait 10 seconds. It seems like we could just remove this function entirely and have the same result.
@pabloem are you still working on this? This fix is great, would love to get this in. |
Since there hasn't been any progress on this in over a month, do you want to put up another PR @BinaryFissionGames? |
Yeah, I'd be happy to! |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
@atoulme @tigrannajaryan Can someone please review 🙏 |
…vailable (#34159) **Description:** <Describe what has changed.> * If the OpAMP server can't be contacted, the supervisor should still be run * This PR also fixes #33799 (as it removes the channel that is blocked on, prevent the reconnect) This PR supercedes #33275 **Link to tracking Issue:** Fixes #33408, #33799 **Testing:** * Added an e2e test for the behavior * Manually tested against BindPlane OP
Closing in favor of #34159 |
Description: An OpAMP supervisor+agent need to be able to work without an opamp server to manage them.
Testing: Added passing e2e test. All other e2e tests continue to pass.
Documentation: None added so far.