-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Agent] Implement an HealthCheck
function in every beats.
#17737
Comments
Pinging @elastic/ingest-management (Team:Ingest Management) |
Pinging @elastic/integrations-services (Team:Services) |
cc @exekias you might be interested too concerning Metricbeat how we could define that the process is healthy. |
@michalpristas Can you look at defining a an interface for the HealthCheck, I think we can create a PR with an interface and a default implementation where we always return "Everything is fine". Let's base us on what we have defined in the contract in https://github.com/elastic/elastic-agent-client/blob/master/elastic-agent-client.proto |
i will think about something we can inject to our client in a similar manner as handlers are injected into http |
@michalpristas Maybe it should also have |
@ph do you mean you can stop health check to be performed or stop as a stop the beat itself |
@ph @michalpristas #18973 Includes graceful stop of the running beat At the moment it also includes based
|
So it appears we do have something in place here.
@blakerouse Yes exactly I would like to parallelize that effort to the platform teams, I suppose each beats might has a different code path to determine that state, so how could we move that responsibility outside of our code base? |
what i had in mind is that beat itself will register handful of |
we could also expose healthcheck.ReportFatal which would be sort of panic in a sense that it will trigger immediate reporting of error to fleet using healhcheck-checking mechanism and Stopping caller using provided Stop function. |
Having a simple callback registered, we need to hook this up with outputs, inputs, modules. One configuration can lead to multiple go-routines, with some go-routines being fine and others not. Plus failures are temporary, so we need some mechanism to change/clear the error state again. In outputs we can have multiple multiple active workers participating in load-balancing. We can introduce a check in the pipeline such that each worker can report progress like
Failing, Fatal states might also report the error. In Filebeat per input state needs likely to be implemented per input type. The reason is that each input configuration can actually spawn multiple go-routines collectings logs. I assume this is still the case with agent. For metricbeat we can implement the state in the 'framework' in a few/most cases I think. Although we spawn multiple go-routines, it is metricbeat who keeps track of those. Semantics would be similar to outputs. Merging 'state' from inputs and outputs would be straight forward.
There is a high chance that the Beat ends up in this state because of miss configuration (e.g. wrong host or credentials are configured in the output, or system module only is configured without running the agent as root). Do we really want to get into a 'loop' where we restart Beats over and over again? Or is 'Failed' reserved for internal errors e.g. in spooling to disk or registry file only? |
I think this will go into degraded and spooling/registry error could go into Failed. |
Yes Being that Agent is not actively pooling for the health status in the new GRPC communication and its the Being that That would allow each beat to just set the status using the |
@blakerouse ++ that would make it immediate feedback, I like the idea. That would require that beat pass around the |
The manager should be an interface so we can wrap it and merge states from different sub-systems, but this is what I had in mind and originally tried for filebeat inputs. The most important is |
@urso When you said simple, if we have 1 out of 3 input that fails, we assume Degraded for the whole process? |
Yeah the manager is already an Interface We would just add the
Using it would be a matter of doing something like
|
what i dont like about calling status directly is that you can overwrite it very easily with concurrent calls. what i had in mind is something like this this would not require passing anything around only thing beats developer needs to do is determine how check looks like and register it inside its package elastic agent client would start checking loop in the background when initiated also providing its Status func as a way of setting status of entire beat based on result of all healthchecks. it would use correct state either healthy, degraded or fatal to raise a fatal callback would just need to do edit: also as mentioned above in previous comment, this proposal misses |
I agree that concurrent calls could overwrite a previous status, but that would be up to the beat to handle how they will implement the updating of that status. Using Doing it in a more generic way with The |
in general I agree with you, my concern and idea behing proposal was mainly about the work we generate for developers. |
Lets try to move the problem a bit, Lets say that we look at it with the beats behavior and possible actions. I see the following.
Now, I think number 3 is actually the more complex one and also the more frequent because it may or may not be linked to a configuration change. But we should still consider the beats as in a Degraded state. This makes me thing we should as much possible delegate that state calculation complexity to the process as we do with endpoint. We should discuss it in our sync. WDYT? cc @ruflin |
Both changes described seem to have a multiple 'status' per Beat (either multiple callbacks to be registered, or status report with details strings). At which granularity do we want to report the 'health' status? |
@urso I don't think we need really high granularity for the first version, I think it's a work in progress. Concerning the implementation, I've chatted over zoom with @blakerouse and @michalpristas independently and I think it shed a few lights on the implementation. After thinking about it, I believe for the Elastic Agent perspective we should keep it simple and let the beats developers decide how they want to calculate, remove duplicates or how they want to track the current state of the beats. @blakerouse proposal to only provide a status function like the following should be enough for v1.
If libbeat developers want to have a callback-based workflow, they would be free to implement's @michalpristas proposal on top of it. This would indeed move the complexity of the state calculation, code structure, and decision out of the Elastic Agent code to the Beats itself. I think this is a good thing. To hand over this to the platform team we can create a PR with a minimal implementation:
|
I also want to add something, by only providing a method this has the side effect of making Agent have the same expectation from beats and endpoint and would consolidate the expertise in a single place. Note, I am not expecting us to have the perfect way of tracking how a beat is behaving, I see this as a step forward that will become better and better with the refactoring done in beats, and having this will drive that move for even better. |
I'm fine with the approach. Having an interface we will merge status messages bottom-up. Be prepared to receive a big multiline message :) |
sounds like a solid plan 👍 |
great, thanks! |
After we upgrade a process we want to be able to understand if the process is healthy or not, I don't think for the first release we need something really complicated like trying to interrogate the outputs.
I think it could be a simple as the process is started, logs works, the web API is started and it's ready to execute configuration. Maybe it's even a flag that is set when the beats have completely started and are about to run the configuration.
I would like to start having a conversation with the Services team to define what it could be.
Also, we would like to be able to retrieve that state through in our manager implementation.
cc @urso @andresrc
The text was updated successfully, but these errors were encountered: