Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Che Monitoring #10329

Closed
10 of 21 tasks
yarivlifchuk opened this issue Jul 8, 2018 · 2 comments
Closed
10 of 21 tasks

Che Monitoring #10329

yarivlifchuk opened this issue Jul 8, 2018 · 2 comments
Labels
kind/epic A long-lived, PM-driven feature request. Must include a checklist of items that must be completed.

Comments

@yarivlifchuk
Copy link

yarivlifchuk commented Jul 8, 2018

Summary

We propose a system monitoring mechanism that for Cluster and Pod level does not require changes to existing Che code. However, for application monitoring of Che agents it requires some changes:

  1. Add special HTTP monitor requests (telemetry) or using the logs and convert it into monitoring metrics by adding special tag to the record.
  2. Add health check command by each agent for monitoring and register with health check configuration policy to the agent manager.
  3. Add health check agent manager within the Pod for monitoring.
  4. Use Custom environment params that are added to the records of the Che agents for customized
    purposes, e.g. user’s tenant (customer) id.
  5. Add critical external health check command by relevant agents that will be used by Kubelet livenessProbe to restart the Pod. In addition, add the agent health check configuration as livenessProbe to the Pod configuration file.

Description

Che epics [Complementary]:
Tracing - #10298, #10288
Logging - #10290

Background

Monitoring Che Workspace(aka WS) agents is required for anticipate problems and discover bottleneck in production environment.
K8S monitor can be categorized as follow

Cluster metrics (System Monitor):
  1. Nodes resource utilization (cpu, memory, disk, network traffic, ...).
  2. Number of available nodes.
  3. Running Pods.
Pods Metrics (System Monitor):
  1. K8S metrics – num of Pod instances vs expected, on progress deployment, health checks.
  2. Container metrics – container cpu, network, memory usage, r/w iops.
Application metrics (Application Monitor):
  1. Health check and other customized metrics.

https://logz.io/blog/kubernetes-monitoring

Prometheus solution

There are many possible combinations of node and cluster-level agents that could comprise a monitoring pipeline. The most popular in K8S is Prometheus which is part of the CNCF.
It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.
Prometheus comes with its own dashboard which is available for running ad-hoc queries or quick debugging, but for best experience it is recommended to be integrated with visualization backends such as Grafana.
https://www.weave.works/technologies/monitoring-kubernetes-with-prometheus

Prometheus Architecture

Prometheus has a cluster level agent and a node level agent (node exporter).
The Node exporter is installed as a DaemonSet that gather machine-level metrics in addition to the metrics exposed by the cAdvisor for each container.
The Prometheus server is installed per cluster. It scrapes and stores time series data from instrumented jobs either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally, runs rules over this data and generate alerts.
https://prometheus.io/docs/introduction/overview/#architecture

Pushgateway

The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus. The Pushgateway is installed per cluster.
In order to expose metrics of Che agents and running applications, the application need to send HTTP POST/PUT with the metric object to the Pushgateway URL.
https://github.com/prometheus/pushgateway

Application Health Checking

Application health checking is required to detect non-functioning agents from application perspective although Pod and Node are considered healthy e.g. deadlock.

External Application Health Check & Recovery

K8S address this problem by supporting user implemented application health checks that are performed by the Kubelet to ensure that the application is operating correctly.
K8S application health checks types:

  1. HTTP Health checks – calling a web hook. Considering http status between 200 and 399 as success, failure otherwise.
  2. Container Exec – execute a command inside the container. Exit with status 0 considered as success otherwise failure.
  3. TCP Socket – open a socket to the container. If connection is established it is considered healthy otherwise failure.

Kubelet can react to two kinds of probes:

  1. LivenessProbe - if Kubelet discovers a failure the container is restarted.
  2. ReadinessProbe – If Kubelet discovers a failure the Pod IP is removed from the services for a period.
    The container health checks are configured in the livenessProbe/readinessProbe section of the container config.

This can be used as an external Health check for critical services.
That way a system outside of the application itself is responsible for monitoring the application and taking action to fix it.

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
https://kubernetes.io/docs/tutorials/k8s201/#application-health-checking
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

Application Health Check Monitoring

While Kubelet use the healthcheck response for a restart action or removing it’s IP it does not give a monitoring tool for different container Health Checks.

The option to do agent monitoring health check using request originating from outside of the Pod is not scalable and can create network loading therefore it should be originated within the Pod.

Each agent should provide health check command for monitoring. To perform the health check there should be a dedicated agent (Health check manager agent) that triggers the health check commands every interval.
Each agent need to register to the health check agent manager and configure it’s health check policy.

The agent manager can expose the results by one of the following:

  1. Expose it to cAdvisor end point (still in alpha. see below).
  2. Send Prometheus metrics to the Pushgateway Pod.
  3. Send dedicated logs that will be monitored – recommended.

cAdvisor solution - Since K8S 1.2 a new feature (still in Alpha) allows cAdvisor to collect custom metrics from applications running in containers, if these metrics are exposed in the Prometheus format natively.
https://github.com/google/cadvisor/blob/master/docs/application_metrics.md
Exposing to cAdvisor is not recommend as it is still in alpha and will add additional dependencies with other components.

Sending Prometheus metrics is less recommended as it creates additional complexity by having the Pushgateway component.

Using the logging [See #10290] for application monitoring is preferred to be more homogenous as it is using the existent logging system and can be correlated to additional information supplied by it. In this case the PushGateway is not required.

Health check agent manger

The health check agent manager can be implemented as

  1. Independent agent within the container.
  2. Healthcheck instruction within the Docker.
    Docker provides Healthcheck instruction that checks the container health by running a command inside the container every time interval.

The Proposed solution for monitoring application health check should be used also to a single centric component (e.g. WS Master) for homogenous solution.

Implementation recommendation

  1. System Monitor of K8S Cluster and Node based on Prometheus system.
  2. Application Monitor of WS agents within the container should follow
    • Sending metrics
      Sending the metrics by adding logs to the WS agent with specific tag that will
      indicate that this log is used for monitoring.
    • Custom environment params
      Added to the records of Che agents for customized purposes, e.g. user’s tenant (customer) id.
    • Internal health check
      Provide health check command by each agent for monitoring.
      In addition each agent should register to the health check agent manager with health check
      configuration policy.
    • Health Check agent manager
      Agent within the Pod that can be implemented as either Independent agent
      or Healthcheck instruction within the Docker (should be further investigated).
    • External health check
      Provide critical health check command by relevant agents to be used by Kubelet livenessProbe to
      restart the Pod. In addition, the agent should add health check configuration policy to the
      livenessProbe part in the Pod configuration file.

Implementation

This was referenced Jul 9, 2018
@slemeur slemeur added the kind/epic A long-lived, PM-driven feature request. Must include a checklist of items that must be completed. label Jul 9, 2018
@fche
Copy link

fche commented Aug 1, 2018

FWIW, keeping at least one form of the metrics available as a http-pollable prometheus-exporter url would be pretty future-proof, even if the cAdvisor machinery were to go away.

@skabashnyuk skabashnyuk changed the title K8S Che6 Monitoring Che Monitoring Jan 28, 2019
@l0rd l0rd mentioned this issue Mar 19, 2019
@skabashnyuk
Copy link
Contributor

Closing. I think we implemented core part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/epic A long-lived, PM-driven feature request. Must include a checklist of items that must be completed.
Projects
None yet
Development

No branches or pull requests

4 participants