Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability Transport #310

Closed
DinaBelova opened this issue Sep 16, 2024 · 6 comments
Closed

Observability Transport #310

DinaBelova opened this issue Sep 16, 2024 · 6 comments
Assignees
Labels
epic Large body of work, can be broken down into individual issues

Comments

@DinaBelova
Copy link
Collaborator

DinaBelova commented Sep 16, 2024

Goals

The observability solution needs to support the scalable transport and consumption of large quantities of collected data from highly distributed sources. The data then need to be processed and stored by multiple backend solutions allowing for different ways to report and analyze the collected data.

The transmission solution needs to support following key capabilities:

  • Scale effectively as the number of managed clusters and instrumentation scales
  • Recover gracefully from failure of the transmission system
  • Recover gracefully from connectivity failure
  • Manage and reduce bandwidth consumption, allowing for queuing and compression of data
  • The solution must use open standard interfaces
  • Aggregate data within a region or user designated domain and the forward the data to a central location

Major deliverables

  • Scalable solution to transport data collected by the instrumentation solution (link tbd)
  • Ability to collect and aggregate data in a region or within a user designated domain
  • Ability to forward aggregated data to a central location

Acceptance criteria

  • The deployment of the transmission solution is scaled with the deployment of the instrumentation
  • The centralized components of the system can be deployed on a cluster of the platform leads choosing

Assumptions
The instrumentation of the cluster will be performed using a solution that is plugable and can support different approaches to the transport of data

User stories
As a platform lead, I want to be able to collect data from the cluster instrumentation in a scaleable way so that I can move large quantities of observability data as efficiently as possible to a central location

@DinaBelova DinaBelova added the epic Large body of work, can be broken down into individual issues label Sep 16, 2024
@pbasov
Copy link

pbasov commented Sep 16, 2024

Build a template with otel-collector-k8s deployment into child cluster and start sending metrics, logs and traces to mothership with OTLP.
https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-kube-stack

  • enable node-exporter
  • enable presets: logsCollection, kubernetesAttributes, kubeletMetrics, kubernetesEvents, clusterMetrics

Mothership should have an OTLP collector to recieve the data and collect local metrics as well. Configure processors to attribute all logs to specific clusters.

Deploy LGTM stack manually with helm on the mothership at first, pipe metrics to storage.

@DinaBelova DinaBelova changed the title [placeholder] observability instrumentation [placeholder] observability transport Sep 16, 2024
@DinaBelova DinaBelova changed the title [placeholder] observability transport Observability transport Sep 16, 2024
@DinaBelova DinaBelova changed the title Observability transport Observability Transport Sep 16, 2024
@pbasov
Copy link

pbasov commented Sep 24, 2024

@pbasov
Copy link

pbasov commented Sep 24, 2024

  • Clickhouse operator config
  • OTEL collector operator config
  • Clickhouse deployments
  • OTEL collector configs for agent and gateway deployments
  • CHProxy config
  • Test hosted control plane templates
  • Build child cluster template chart

@pbasov
Copy link

pbasov commented Sep 26, 2024

Developing this on our GPU cluster, PoC commit:
https://github.com/Mirantis/ai-research/commit/6baba7465e151ffc7b1b44f2e207115e417451a0

@pbasov
Copy link

pbasov commented Oct 15, 2024

@DinaBelova
Copy link
Collaborator Author

closing this one, will continue work in #488 and #489

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Large body of work, can be broken down into individual issues
Projects
Status: Done
Development

No branches or pull requests

2 participants