Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Agent] Implement an HealthCheck function in every beats. #17737

Closed
ph opened this issue Apr 15, 2020 · 29 comments · Fixed by #19114
Closed

[Agent] Implement an HealthCheck function in every beats. #17737

ph opened this issue Apr 15, 2020 · 29 comments · Fixed by #19114
Assignees
Labels
discuss Issue needs further discussion. Ingest Management:beta1 Group issues for ingest management beta1 Team:Services (Deprecated) Label for the former Integrations-Services team

Comments

@ph
Copy link
Contributor

ph commented Apr 15, 2020

After we upgrade a process we want to be able to understand if the process is healthy or not, I don't think for the first release we need something really complicated like trying to interrogate the outputs.

I think it could be a simple as the process is started, logs works, the web API is started and it's ready to execute configuration. Maybe it's even a flag that is set when the beats have completely started and are about to run the configuration.

I would like to start having a conversation with the Services team to define what it could be.

Also, we would like to be able to retrieve that state through in our manager implementation.

cc @urso @andresrc

@ph ph added discuss Issue needs further discussion. Team:Services (Deprecated) Label for the former Integrations-Services team Team:Ingest Management labels Apr 15, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-services (Team:Services)

@ph
Copy link
Contributor Author

ph commented Apr 16, 2020

cc @exekias you might be interested too concerning Metricbeat how we could define that the process is healthy.

@ph
Copy link
Contributor Author

ph commented Jun 2, 2020

@michalpristas Can you look at defining a an interface for the HealthCheck, I think we can create a PR with an interface and a default implementation where we always return "Everything is fine". Let's base us on what we have defined in the contract in https://github.com/elastic/elastic-agent-client/blob/master/elastic-agent-client.proto

@michalpristas
Copy link
Contributor

i will think about something we can inject to our client in a similar manner as handlers are injected into http

@ph
Copy link
Contributor Author

ph commented Jun 4, 2020

@michalpristas Maybe it should also have Stop() function to gracefully stop the process? cc @blakerouse

@michalpristas
Copy link
Contributor

@ph do you mean you can stop health check to be performed or stop as a stop the beat itself
what about Register call returning a fn which will remove registered healthcheck from the list

@blakerouse
Copy link
Contributor

@ph @michalpristas #18973 Includes graceful stop of the running beat beat.Stop is called when Agent request it to stop. This is also tested with the configurable test application.

At the moment it also includes based Configuring -> Healthy transition, but nothing more. We need to determine what it means in each beat to be:

Healthy, Degraded, or Failed (which will cause it to be killed and restart).

@ph
Copy link
Contributor Author

ph commented Jun 4, 2020

So it appears we do have something in place here.

Healthy, Degraded, or Failed (which will cause it to be killed and restart).

@blakerouse Yes exactly I would like to parallelize that effort to the platform teams, I suppose each beats might has a different code path to determine that state, so how could we move that responsibility outside of our code base?

@michalpristas
Copy link
Contributor

what i had in mind is that beat itself will register handful of fn()error checks which will be called periodically by client. if no error is returned we're fine, otherwise we set status and message to the failures of these functions (one or more)
this way we provide machinery and each beat will define it's own way of determining health

@michalpristas
Copy link
Contributor

we could also expose healthcheck.ReportFatal which would be sort of panic in a sense that it will trigger immediate reporting of error to fleet using healhcheck-checking mechanism and Stopping caller using provided Stop function.
this method would work outside of periodic loop and should be used only for Fatal failures

@urso
Copy link

urso commented Jun 5, 2020

Having a simple callback registered, we need to hook this up with outputs, inputs, modules.

One configuration can lead to multiple go-routines, with some go-routines being fine and others not. Plus failures are temporary, so we need some mechanism to change/clear the error state again.

In outputs we can have multiple multiple active workers participating in load-balancing. We can introduce a check in the pipeline such that each worker can report progress like Init, Running, Failing, Reconnect, Fatal. The outputs state would become:

  • Healthy: if all outputs are in Init Running state (Beats connect lazily, so we can't tell if a connection is healthy without an event).
  • Degraded: if at least one worker is in Failing, Reconnect, or Fatal state
  • Failed: if all workers are in Failing, Reconnect, or Fatal state

Failing, Fatal states might also report the error.

In Filebeat per input state needs likely to be implemented per input type. The reason is that each input configuration can actually spawn multiple go-routines collectings logs. I assume this is still the case with agent.
Just reporting an error for an input if one worker fails might be a little 'drastic'. This is why an enum of simple states like Healthy, Degraded, Fatal might be better.

For metricbeat we can implement the state in the 'framework' in a few/most cases I think. Although we spawn multiple go-routines, it is metricbeat who keeps track of those. Semantics would be similar to outputs.

Merging 'state' from inputs and outputs would be straight forward.

Failed (which will cause it to be killed and restart)

There is a high chance that the Beat ends up in this state because of miss configuration (e.g. wrong host or credentials are configured in the output, or system module only is configured without running the agent as root). Do we really want to get into a 'loop' where we restart Beats over and over again? Or is 'Failed' reserved for internal errors e.g. in spooling to disk or registry file only?

@ph
Copy link
Contributor Author

ph commented Jun 5, 2020

There is a high chance that the Beat ends up in this state because of miss configuration (e.g. wrong host or credentials are configured in the output, or system module only is configured without running the agent as root)

I think this will go into degraded and spooling/registry error could go into Failed.

@blakerouse
Copy link
Contributor

Yes Failed is a very bad state. Should only be set in the case that beats should try to be restarted.

Being that Agent is not actively pooling for the health status in the new GRPC communication and its the elastic-agent-client that will push status updates as soon as the client changes it status, then we should implement it in a way that is responsive. (e.g. as soon as communication is lost to an output its set to Degraded.)

Being that ConfigManager is directly on type Beat struct, probably be good to rename that to Manager. Then expose a new Status(management.StatusType, string) on the manager.

That would allow each beat to just set the status using the Manager of the Beat. Then the FleetConfigManager can push that Status through to the elastic-agent-client to Elastic Agent.

@ph ph removed their assignment Jun 5, 2020
@ph
Copy link
Contributor Author

ph commented Jun 5, 2020

@blakerouse ++ that would make it immediate feedback, I like the idea. That would require that beat pass around the Manager but I think it's ok too. WDYT @urso ?

@urso
Copy link

urso commented Jun 5, 2020

The manager should be an interface so we can wrap it and merge states from different sub-systems, but this is what I had in mind and originally tried for filebeat inputs. The most important is StatusType to be simple, as this is the one I originally failed when I tried to merge states within one input that spawns multiple harvesters.

@ph
Copy link
Contributor Author

ph commented Jun 5, 2020

@urso When you said simple, if we have 1 out of 3 input that fails, we assume Degraded for the whole process?

@blakerouse
Copy link
Contributor

Yeah the manager is already an Interface management.ConfigManager.

We would just add the Status function to the interface. Also the StatusType would be very simple.

type StatusType int

const (
    Starting StatusType = iota
    Healthy
    Degraded
    Failed
    Stopping
)

Using it would be a matter of doing something like

beat.Manager.Status(management.Degraded, "2 of 3 modules healthy; 'ceph' is failing")

@michalpristas
Copy link
Contributor

michalpristas commented Jun 5, 2020

what i dont like about calling status directly is that you can overwrite it very easily with concurrent calls. what i had in mind is something like this
elastic/elastic-agent-client#10

this would not require passing anything around
wherever healtcheck is needed it would just call healthcheck.Register(b.checkFn)
this would add it to the list of checks.

only thing beats developer needs to do is determine how check looks like and register it inside its package

elastic agent client would start checking loop in the background when initiated also providing its Status func as a way of setting status of entire beat based on result of all healthchecks. it would use correct state either healthy, degraded or fatal

to raise a fatal callback would just need to do
return healthcheck.NewFatal(err)

edit: also as mentioned above in previous comment, this proposal misses healthcheck.ReportFatal which would avoid waiting for a loop and triggers Status change immediately and stops any further health checking as application will be restarted anyway. but also in this case we dont need to carry instances around

@blakerouse
Copy link
Contributor

I agree that concurrent calls could overwrite a previous status, but that would be up to the beat to handle how they will implement the updating of that status.

Using healthcheck.Register will result in delayed status update based on the ticker which will slow down the feedback loop. I also personally do not like globals. Feels weird to share the healthcheck across multiple clients. If we were go with the healthcheck.Register path, would be better to implement on client.Client in elastic-agent-client instead of doing a global.

Doing it in a more generic way with Status() would also allow the status to be printed to the log for each beat, allowing Status not to just be something used when beat is run under Agent. As well as allowing management.ConfigManager to handle passing the status to elastic-agent-client, when using the fleet manager.

The beat.Beat is already passed into Beater.Run so b.Manager will be right there and that would only need to be passed into the different managers of the inputs and outputs.

@michalpristas
Copy link
Contributor

in general I agree with you, my concern and idea behing proposal was mainly about the work we generate for developers.
more weight we put on developers shoulders, more we will wait for good checks. so i would rather provide a framework which allows them to have something very cheap but with ability to use direct calls and self management if they decide.
normally i would be against globals but i like usability of http package, it provides you bunch of register calls which are handled by global muxer on behind, basic flow is out of the box and more complex behavior is allowed as well if needed.

@ph
Copy link
Contributor Author

ph commented Jun 5, 2020

Lets try to move the problem a bit, Lets say that we look at it with the beats behavior and possible actions. I see the following.

  1. immediate error feedback: This is mostly related to creating inputs or outputs with configuration.
  2. transient error: Elasticsearch is unreachable, The beat could decide to notify us back right away or try a few times.
  3. fatal: I am trying to start the beats the spooling to disk is not readble. We should bail out.

Now, I think number 3 is actually the more complex one and also the more frequent because it may or may not be linked to a configuration change. But we should still consider the beats as in a Degraded state. This makes me thing we should as much possible delegate that state calculation complexity to the process as we do with endpoint. We should discuss it in our sync. WDYT? cc @ruflin

@urso
Copy link

urso commented Jun 8, 2020

Both changes described seem to have a multiple 'status' per Beat (either multiple callbacks to be registered, or status report with details strings). At which granularity do we want to report the 'health' status?

@ph
Copy link
Contributor Author

ph commented Jun 9, 2020

@urso I don't think we need really high granularity for the first version, I think it's a work in progress.


Concerning the implementation, I've chatted over zoom with @blakerouse and @michalpristas independently and I think it shed a few lights on the implementation.

After thinking about it, I believe for the Elastic Agent perspective we should keep it simple and let the beats developers decide how they want to calculate, remove duplicates or how they want to track the current state of the beats.

@blakerouse proposal to only provide a status function like the following should be enough for v1.

beat.Manager.Status(management.Degraded, "2 of 3 modules healthy; 'ceph' is failing")

If libbeat developers want to have a callback-based workflow, they would be free to implement's @michalpristas proposal on top of it. This would indeed move the complexity of the state calculation, code structure, and decision out of the Elastic Agent code to the Beats itself. I think this is a good thing.

To hand over this to the platform team we can create a PR with a minimal implementation:

  • Give them the states that we support.
  • Write in the method godoc what is the behavior with the different states: restart, nothing, etc.
  • Write in the method godoc example of problems that should be raised as degraded, failed, etc.
  • Implement a happy path in the code that return healthy.
  • Offer guidance to the platform team to improve state detection and make sure is propagated correctly.

@ph
Copy link
Contributor Author

ph commented Jun 9, 2020

I also want to add something, by only providing a method this has the side effect of making Agent have the same expectation from beats and endpoint and would consolidate the expertise in a single place.

Note, I am not expecting us to have the perfect way of tracking how a beat is behaving, I see this as a step forward that will become better and better with the refactoring done in beats, and having this will drive that move for even better.

@urso
Copy link

urso commented Jun 9, 2020

I'm fine with the approach. Having an interface we will merge status messages bottom-up. Be prepared to receive a big multiline message :)
For the message we might also consider []string, stringer like interface (to build a tree of messages), or remove it (Beats logs should already have details). But the later can be provided by wrapping the "simple" RPC callback with a richer interface if we have to.

@michalpristas
Copy link
Contributor

sounds like a solid plan 👍

@ph
Copy link
Contributor Author

ph commented Jun 10, 2020

@andres This issue will be closed for us soon, I've created #19115 as a followup for the service team to refine the implementation if it's OK with you.

@andresrc
Copy link
Contributor

great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs further discussion. Ingest Management:beta1 Group issues for ingest management beta1 Team:Services (Deprecated) Label for the former Integrations-Services team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants