[Agent] Implement an `HealthCheck` function in every beats. #17737

ph · 2020-04-15T19:47:56Z

After we upgrade a process we want to be able to understand if the process is healthy or not, I don't think for the first release we need something really complicated like trying to interrogate the outputs.

I think it could be a simple as the process is started, logs works, the web API is started and it's ready to execute configuration. Maybe it's even a flag that is set when the beats have completely started and are about to run the configuration.

I would like to start having a conversation with the Services team to define what it could be.

Also, we would like to be able to retrieve that state through in our manager implementation.

cc @urso @andresrc

elasticmachine · 2020-04-15T19:47:57Z

Pinging @elastic/ingest-management (Team:Ingest Management)

elasticmachine · 2020-04-15T19:47:59Z

Pinging @elastic/integrations-services (Team:Services)

ph · 2020-04-16T12:30:56Z

cc @exekias you might be interested too concerning Metricbeat how we could define that the process is healthy.

ph · 2020-06-02T19:37:53Z

@michalpristas Can you look at defining a an interface for the HealthCheck, I think we can create a PR with an interface and a default implementation where we always return "Everything is fine". Let's base us on what we have defined in the contract in https://github.com/elastic/elastic-agent-client/blob/master/elastic-agent-client.proto

michalpristas · 2020-06-03T06:31:42Z

i will think about something we can inject to our client in a similar manner as handlers are injected into http

ph · 2020-06-04T14:53:49Z

@michalpristas Maybe it should also have Stop() function to gracefully stop the process? cc @blakerouse

michalpristas · 2020-06-04T15:00:50Z

@ph do you mean you can stop health check to be performed or stop as a stop the beat itself
what about Register call returning a fn which will remove registered healthcheck from the list

blakerouse · 2020-06-04T15:20:19Z

@ph @michalpristas #18973 Includes graceful stop of the running beat beat.Stop is called when Agent request it to stop. This is also tested with the configurable test application.

At the moment it also includes based Configuring -> Healthy transition, but nothing more. We need to determine what it means in each beat to be:

Healthy, Degraded, or Failed (which will cause it to be killed and restart).

ph · 2020-06-04T15:29:48Z

So it appears we do have something in place here.

Healthy, Degraded, or Failed (which will cause it to be killed and restart).

@blakerouse Yes exactly I would like to parallelize that effort to the platform teams, I suppose each beats might has a different code path to determine that state, so how could we move that responsibility outside of our code base?

michalpristas · 2020-06-04T15:51:36Z

what i had in mind is that beat itself will register handful of fn()error checks which will be called periodically by client. if no error is returned we're fine, otherwise we set status and message to the failures of these functions (one or more)
this way we provide machinery and each beat will define it's own way of determining health

michalpristas · 2020-06-04T16:02:43Z

we could also expose healthcheck.ReportFatal which would be sort of panic in a sense that it will trigger immediate reporting of error to fleet using healhcheck-checking mechanism and Stopping caller using provided Stop function.
this method would work outside of periodic loop and should be used only for Fatal failures

urso · 2020-06-05T11:28:19Z

Having a simple callback registered, we need to hook this up with outputs, inputs, modules.

One configuration can lead to multiple go-routines, with some go-routines being fine and others not. Plus failures are temporary, so we need some mechanism to change/clear the error state again.

In outputs we can have multiple multiple active workers participating in load-balancing. We can introduce a check in the pipeline such that each worker can report progress like Init, Running, Failing, Reconnect, Fatal. The outputs state would become:

Healthy: if all outputs are in Init Running state (Beats connect lazily, so we can't tell if a connection is healthy without an event).
Degraded: if at least one worker is in Failing, Reconnect, or Fatal state
Failed: if all workers are in Failing, Reconnect, or Fatal state

Failing, Fatal states might also report the error.

In Filebeat per input state needs likely to be implemented per input type. The reason is that each input configuration can actually spawn multiple go-routines collectings logs. I assume this is still the case with agent.
Just reporting an error for an input if one worker fails might be a little 'drastic'. This is why an enum of simple states like Healthy, Degraded, Fatal might be better.

For metricbeat we can implement the state in the 'framework' in a few/most cases I think. Although we spawn multiple go-routines, it is metricbeat who keeps track of those. Semantics would be similar to outputs.

Merging 'state' from inputs and outputs would be straight forward.

Failed (which will cause it to be killed and restart)

There is a high chance that the Beat ends up in this state because of miss configuration (e.g. wrong host or credentials are configured in the output, or system module only is configured without running the agent as root). Do we really want to get into a 'loop' where we restart Beats over and over again? Or is 'Failed' reserved for internal errors e.g. in spooling to disk or registry file only?

ph · 2020-06-05T12:57:21Z

There is a high chance that the Beat ends up in this state because of miss configuration (e.g. wrong host or credentials are configured in the output, or system module only is configured without running the agent as root)

I think this will go into degraded and spooling/registry error could go into Failed.

blakerouse · 2020-06-05T13:36:03Z

Yes Failed is a very bad state. Should only be set in the case that beats should try to be restarted.

Being that Agent is not actively pooling for the health status in the new GRPC communication and its the elastic-agent-client that will push status updates as soon as the client changes it status, then we should implement it in a way that is responsive. (e.g. as soon as communication is lost to an output its set to Degraded.)

Being that ConfigManager is directly on type Beat struct, probably be good to rename that to Manager. Then expose a new Status(management.StatusType, string) on the manager.

That would allow each beat to just set the status using the Manager of the Beat. Then the FleetConfigManager can push that Status through to the elastic-agent-client to Elastic Agent.

ph · 2020-06-05T13:44:17Z

@blakerouse ++ that would make it immediate feedback, I like the idea. That would require that beat pass around the Manager but I think it's ok too. WDYT @urso ?

urso · 2020-06-05T13:56:44Z

The manager should be an interface so we can wrap it and merge states from different sub-systems, but this is what I had in mind and originally tried for filebeat inputs. The most important is StatusType to be simple, as this is the one I originally failed when I tried to merge states within one input that spawns multiple harvesters.

ph · 2020-06-05T14:42:50Z

@urso When you said simple, if we have 1 out of 3 input that fails, we assume Degraded for the whole process?

blakerouse · 2020-06-05T15:45:22Z

Yeah the manager is already an Interface management.ConfigManager.

We would just add the Status function to the interface. Also the StatusType would be very simple.

type StatusType int

const (
    Starting StatusType = iota
    Healthy
    Degraded
    Failed
    Stopping
)

Using it would be a matter of doing something like

beat.Manager.Status(management.Degraded, "2 of 3 modules healthy; 'ceph' is failing")

michalpristas · 2020-06-05T16:01:30Z

what i dont like about calling status directly is that you can overwrite it very easily with concurrent calls. what i had in mind is something like this
elastic/elastic-agent-client#10

this would not require passing anything around
wherever healtcheck is needed it would just call healthcheck.Register(b.checkFn)
this would add it to the list of checks.

only thing beats developer needs to do is determine how check looks like and register it inside its package

elastic agent client would start checking loop in the background when initiated also providing its Status func as a way of setting status of entire beat based on result of all healthchecks. it would use correct state either healthy, degraded or fatal

to raise a fatal callback would just need to do
return healthcheck.NewFatal(err)

edit: also as mentioned above in previous comment, this proposal misses healthcheck.ReportFatal which would avoid waiting for a loop and triggers Status change immediately and stops any further health checking as application will be restarted anyway. but also in this case we dont need to carry instances around

blakerouse · 2020-06-05T16:51:18Z

I agree that concurrent calls could overwrite a previous status, but that would be up to the beat to handle how they will implement the updating of that status.

Using healthcheck.Register will result in delayed status update based on the ticker which will slow down the feedback loop. I also personally do not like globals. Feels weird to share the healthcheck across multiple clients. If we were go with the healthcheck.Register path, would be better to implement on client.Client in elastic-agent-client instead of doing a global.

Doing it in a more generic way with Status() would also allow the status to be printed to the log for each beat, allowing Status not to just be something used when beat is run under Agent. As well as allowing management.ConfigManager to handle passing the status to elastic-agent-client, when using the fleet manager.

The beat.Beat is already passed into Beater.Run so b.Manager will be right there and that would only need to be passed into the different managers of the inputs and outputs.

michalpristas · 2020-06-05T19:20:23Z

in general I agree with you, my concern and idea behing proposal was mainly about the work we generate for developers.
more weight we put on developers shoulders, more we will wait for good checks. so i would rather provide a framework which allows them to have something very cheap but with ability to use direct calls and self management if they decide.
normally i would be against globals but i like usability of http package, it provides you bunch of register calls which are handled by global muxer on behind, basic flow is out of the box and more complex behavior is allowed as well if needed.

ph · 2020-06-05T19:59:28Z

Lets try to move the problem a bit, Lets say that we look at it with the beats behavior and possible actions. I see the following.

immediate error feedback: This is mostly related to creating inputs or outputs with configuration.
transient error: Elasticsearch is unreachable, The beat could decide to notify us back right away or try a few times.
fatal: I am trying to start the beats the spooling to disk is not readble. We should bail out.

Now, I think number 3 is actually the more complex one and also the more frequent because it may or may not be linked to a configuration change. But we should still consider the beats as in a Degraded state. This makes me thing we should as much possible delegate that state calculation complexity to the process as we do with endpoint. We should discuss it in our sync. WDYT? cc @ruflin

urso · 2020-06-08T12:27:08Z

Both changes described seem to have a multiple 'status' per Beat (either multiple callbacks to be registered, or status report with details strings). At which granularity do we want to report the 'health' status?

ph · 2020-06-09T13:02:30Z

@urso I don't think we need really high granularity for the first version, I think it's a work in progress.

Concerning the implementation, I've chatted over zoom with @blakerouse and @michalpristas independently and I think it shed a few lights on the implementation.

After thinking about it, I believe for the Elastic Agent perspective we should keep it simple and let the beats developers decide how they want to calculate, remove duplicates or how they want to track the current state of the beats.

@blakerouse proposal to only provide a status function like the following should be enough for v1.

beat.Manager.Status(management.Degraded, "2 of 3 modules healthy; 'ceph' is failing")

If libbeat developers want to have a callback-based workflow, they would be free to implement's @michalpristas proposal on top of it. This would indeed move the complexity of the state calculation, code structure, and decision out of the Elastic Agent code to the Beats itself. I think this is a good thing.

To hand over this to the platform team we can create a PR with a minimal implementation:

Give them the states that we support.
Write in the method godoc what is the behavior with the different states: restart, nothing, etc.
Write in the method godoc example of problems that should be raised as degraded, failed, etc.
Implement a happy path in the code that return healthy.
Offer guidance to the platform team to improve state detection and make sure is propagated correctly.

ph · 2020-06-09T14:58:10Z

I also want to add something, by only providing a method this has the side effect of making Agent have the same expectation from beats and endpoint and would consolidate the expertise in a single place.

Note, I am not expecting us to have the perfect way of tracking how a beat is behaving, I see this as a step forward that will become better and better with the refactoring done in beats, and having this will drive that move for even better.

urso · 2020-06-09T15:07:51Z

I'm fine with the approach. Having an interface we will merge status messages bottom-up. Be prepared to receive a big multiline message :)
For the message we might also consider []string, stringer like interface (to build a tree of messages), or remove it (Beats logs should already have details). But the later can be provided by wrapping the "simple" RPC callback with a richer interface if we have to.

michalpristas · 2020-06-10T07:08:53Z

sounds like a solid plan 👍

ph · 2020-06-10T17:31:08Z

@andres This issue will be closed for us soon, I've created #19115 as a followup for the service team to refine the implementation if it's OK with you.

andresrc · 2020-06-10T20:43:07Z

great, thanks!

ph added discuss Issue needs further discussion. Team:Services (Deprecated) Label for the former Integrations-Services team Team:Ingest Management labels Apr 15, 2020

ruflin assigned ph Apr 23, 2020

ph added the Ingest Management:beta1 Group issues for ingest management beta1 label Apr 23, 2020

michalpristas mentioned this issue Apr 30, 2020

[Elastic-Agent] Work around unix socket path length limit #18118

Closed

ph assigned michalpristas Jun 2, 2020

ph removed their assignment Jun 5, 2020

This was referenced Jun 10, 2020

[Elastic Agent] Show an error if ES cannot be reached #19086

Closed

Rename to management.Manager, add UpdateStatus to Manager interface. #19114

Merged

ph assigned blakerouse and unassigned michalpristas Jun 10, 2020

ph mentioned this issue Jun 10, 2020

[libbeat] Calculate the health status of beats. #19115

Closed

3 tasks

blakerouse closed this as completed in #19114 Jun 12, 2020

blakerouse mentioned this issue Jun 12, 2020

Cherry-pick #19114 to 7.x: Rename to management.Manager, add UpdateStatus to Manager interface. #19172

Merged

4 tasks

mostlyjason mentioned this issue Aug 18, 2020

[Fleet] Agent status improvements elastic/kibana#75236

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Agent] Implement an `HealthCheck` function in every beats. #17737

[Agent] Implement an `HealthCheck` function in every beats. #17737

ph commented Apr 15, 2020

elasticmachine commented Apr 15, 2020

elasticmachine commented Apr 15, 2020

ph commented Apr 16, 2020

ph commented Jun 2, 2020

michalpristas commented Jun 3, 2020

ph commented Jun 4, 2020

michalpristas commented Jun 4, 2020

blakerouse commented Jun 4, 2020

ph commented Jun 4, 2020

michalpristas commented Jun 4, 2020

michalpristas commented Jun 4, 2020

urso commented Jun 5, 2020

ph commented Jun 5, 2020

blakerouse commented Jun 5, 2020

ph commented Jun 5, 2020

urso commented Jun 5, 2020

ph commented Jun 5, 2020

blakerouse commented Jun 5, 2020

michalpristas commented Jun 5, 2020 •

edited

Loading

blakerouse commented Jun 5, 2020

michalpristas commented Jun 5, 2020

ph commented Jun 5, 2020

urso commented Jun 8, 2020

ph commented Jun 9, 2020

ph commented Jun 9, 2020

urso commented Jun 9, 2020

michalpristas commented Jun 10, 2020

ph commented Jun 10, 2020

andresrc commented Jun 10, 2020

[Agent] Implement an HealthCheck function in every beats. #17737

[Agent] Implement an HealthCheck function in every beats. #17737

Comments

ph commented Apr 15, 2020

elasticmachine commented Apr 15, 2020

elasticmachine commented Apr 15, 2020

ph commented Apr 16, 2020

ph commented Jun 2, 2020

michalpristas commented Jun 3, 2020

ph commented Jun 4, 2020

michalpristas commented Jun 4, 2020

blakerouse commented Jun 4, 2020

ph commented Jun 4, 2020

michalpristas commented Jun 4, 2020

michalpristas commented Jun 4, 2020

urso commented Jun 5, 2020

ph commented Jun 5, 2020

blakerouse commented Jun 5, 2020

ph commented Jun 5, 2020

urso commented Jun 5, 2020

ph commented Jun 5, 2020

blakerouse commented Jun 5, 2020

michalpristas commented Jun 5, 2020 • edited Loading

blakerouse commented Jun 5, 2020

michalpristas commented Jun 5, 2020

ph commented Jun 5, 2020

urso commented Jun 8, 2020

ph commented Jun 9, 2020

ph commented Jun 9, 2020

urso commented Jun 9, 2020

michalpristas commented Jun 10, 2020

ph commented Jun 10, 2020

andresrc commented Jun 10, 2020

[Agent] Implement an `HealthCheck` function in every beats. #17737

[Agent] Implement an `HealthCheck` function in every beats. #17737

michalpristas commented Jun 5, 2020 •

edited

Loading