Che Monitoring #10329

yarivlifchuk · 2018-07-08T20:13:46Z

Summary

We propose a system monitoring mechanism that for Cluster and Pod level does not require changes to existing Che code. However, for application monitoring of Che agents it requires some changes:

Add special HTTP monitor requests (telemetry) or using the logs and convert it into monitoring metrics by adding special tag to the record.
Add health check command by each agent for monitoring and register with health check configuration policy to the agent manager.
Add health check agent manager within the Pod for monitoring.
Use Custom environment params that are added to the records of the Che agents for customized
purposes, e.g. user’s tenant (customer) id.
Add critical external health check command by relevant agents that will be used by Kubelet livenessProbe to restart the Pod. In addition, add the agent health check configuration as livenessProbe to the Pod configuration file.

Description

Che epics [Complementary]:
Tracing - #10298, #10288
Logging - #10290

Background

Monitoring Che Workspace(aka WS) agents is required for anticipate problems and discover bottleneck in production environment.
K8S monitor can be categorized as follow

Cluster metrics (System Monitor):

Nodes resource utilization (cpu, memory, disk, network traffic, ...).
Number of available nodes.
Running Pods.

Pods Metrics (System Monitor):

K8S metrics – num of Pod instances vs expected, on progress deployment, health checks.
Container metrics – container cpu, network, memory usage, r/w iops.

Application metrics (Application Monitor):

Health check and other customized metrics.

https://logz.io/blog/kubernetes-monitoring

Prometheus solution

There are many possible combinations of node and cluster-level agents that could comprise a monitoring pipeline. The most popular in K8S is Prometheus which is part of the CNCF.
It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.
Prometheus comes with its own dashboard which is available for running ad-hoc queries or quick debugging, but for best experience it is recommended to be integrated with visualization backends such as Grafana.
https://www.weave.works/technologies/monitoring-kubernetes-with-prometheus

Prometheus Architecture

Prometheus has a cluster level agent and a node level agent (node exporter).
The Node exporter is installed as a DaemonSet that gather machine-level metrics in addition to the metrics exposed by the cAdvisor for each container.
The Prometheus server is installed per cluster. It scrapes and stores time series data from instrumented jobs either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally, runs rules over this data and generate alerts.
https://prometheus.io/docs/introduction/overview/#architecture

Pushgateway

The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus. The Pushgateway is installed per cluster.
In order to expose metrics of Che agents and running applications, the application need to send HTTP POST/PUT with the metric object to the Pushgateway URL.
https://github.com/prometheus/pushgateway

Application Health Checking

Application health checking is required to detect non-functioning agents from application perspective although Pod and Node are considered healthy e.g. deadlock.

External Application Health Check & Recovery

K8S address this problem by supporting user implemented application health checks that are performed by the Kubelet to ensure that the application is operating correctly.
K8S application health checks types:

HTTP Health checks – calling a web hook. Considering http status between 200 and 399 as success, failure otherwise.
Container Exec – execute a command inside the container. Exit with status 0 considered as success otherwise failure.
TCP Socket – open a socket to the container. If connection is established it is considered healthy otherwise failure.

Kubelet can react to two kinds of probes:

LivenessProbe - if Kubelet discovers a failure the container is restarted.
ReadinessProbe – If Kubelet discovers a failure the Pod IP is removed from the services for a period.
The container health checks are configured in the livenessProbe/readinessProbe section of the container config.

This can be used as an external Health check for critical services.
That way a system outside of the application itself is responsible for monitoring the application and taking action to fix it.

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
https://kubernetes.io/docs/tutorials/k8s201/#application-health-checking
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

Application Health Check Monitoring

While Kubelet use the healthcheck response for a restart action or removing it’s IP it does not give a monitoring tool for different container Health Checks.

The option to do agent monitoring health check using request originating from outside of the Pod is not scalable and can create network loading therefore it should be originated within the Pod.

Each agent should provide health check command for monitoring. To perform the health check there should be a dedicated agent (Health check manager agent) that triggers the health check commands every interval.
Each agent need to register to the health check agent manager and configure it’s health check policy.

The agent manager can expose the results by one of the following:

Expose it to cAdvisor end point (still in alpha. see below).
Send Prometheus metrics to the Pushgateway Pod.
Send dedicated logs that will be monitored – recommended.

cAdvisor solution - Since K8S 1.2 a new feature (still in Alpha) allows cAdvisor to collect custom metrics from applications running in containers, if these metrics are exposed in the Prometheus format natively.
https://github.com/google/cadvisor/blob/master/docs/application_metrics.md
Exposing to cAdvisor is not recommend as it is still in alpha and will add additional dependencies with other components.

Sending Prometheus metrics is less recommended as it creates additional complexity by having the Pushgateway component.

Using the logging [See #10290] for application monitoring is preferred to be more homogenous as it is using the existent logging system and can be correlated to additional information supplied by it. In this case the PushGateway is not required.

Health check agent manger

The health check agent manager can be implemented as

Independent agent within the container.
Healthcheck instruction within the Docker.
Docker provides Healthcheck instruction that checks the container health by running a command inside the container every time interval.

The Proposed solution for monitoring application health check should be used also to a single centric component (e.g. WS Master) for homogenous solution.

Implementation recommendation

System Monitor of K8S Cluster and Node based on Prometheus system.
Application Monitor of WS agents within the container should follow
- Sending metrics
  Sending the metrics by adding logs to the WS agent with specific tag that will
  indicate that this log is used for monitoring.
- Custom environment params
  Added to the records of Che agents for customized purposes, e.g. user’s tenant (customer) id.
- Internal health check
  Provide health check command by each agent for monitoring.
  In addition each agent should register to the health check agent manager with health check
  configuration policy.
- Health Check agent manager
  Agent within the Pod that can be implemented as either Independent agent
  or Healthcheck instruction within the Docker (should be further investigated).
- External health check
  Provide critical health check command by relevant agents to be used by Kubelet livenessProbe to
  restart the Pod. In addition, the agent should add health check configuration policy to the
  livenessProbe part in the Pod configuration file.

Implementation

fche · 2018-08-01T21:41:24Z

FWIW, keeping at least one form of the metrics available as a http-pollable prometheus-exporter url would be pretty future-proof, even if the cAdvisor machinery were to go away.

skabashnyuk · 2019-09-19T12:59:51Z

Closing. I think we implemented core part.

This was referenced Jul 9, 2018

Che Logging #10290

Closed

Che Tracing #10298

Closed

slemeur added the kind/epic A long-lived, PM-driven feature request. Must include a checklist of items that must be completed. label Jul 9, 2018

skabashnyuk changed the title ~~K8S Che6 Monitoring~~ Che Monitoring Jan 28, 2019

skabashnyuk mentioned this issue Jan 28, 2019

Generic telemetry events infrastructure #5483

Closed

3 tasks

l0rd mentioned this issue Mar 19, 2019

Che 7 GA #12696

Closed

l0rd mentioned this issue May 2, 2019

[Metrics] Observability requirements for Hosted Che #13270

Closed

11 tasks

skabashnyuk closed this as completed Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Che Monitoring #10329

Che Monitoring #10329

yarivlifchuk commented Jul 8, 2018 •

edited by l0rd

Loading

fche commented Aug 1, 2018

skabashnyuk commented Sep 19, 2019

Che Monitoring #10329

Che Monitoring #10329

Comments

yarivlifchuk commented Jul 8, 2018 • edited by l0rd Loading

Summary

Description

Background

Cluster metrics (System Monitor):

Pods Metrics (System Monitor):

Application metrics (Application Monitor):

Prometheus solution

Prometheus Architecture

Pushgateway

Application Health Checking

External Application Health Check & Recovery

Application Health Check Monitoring

Health check agent manger

Implementation recommendation

Implementation

fche commented Aug 1, 2018

skabashnyuk commented Sep 19, 2019

yarivlifchuk commented Jul 8, 2018 •

edited by l0rd

Loading