Skip to content

Commit

Permalink
Add usage metrics (Hodometer) docs (#343)
Browse files Browse the repository at this point in the history
* Move operational metrics docs under general metrics page

* Capitalise proper nouns & acronyms/initialisms

* Use one sentence per line for operational metrics

* Add empty docs page for usage metrics

* Add list of metrics to docs from Hodometer README

* Add intro & privacy sections to Hodometer docs

* Add link to table of metrics in Hodometer docs

* Add section on metrics levels in Hodometer docs

* Change list of metrics level to table

* Use metrics levels name corresponding to allowed values in table

* Add table of Hodometer env vars to usage metrics docs

* Add sed script adding Helm templating for Hodometer deployment

* Split sed script over multiple lines for legibility

* Simplify sed script for maintainability

Inserting before and appending after a multi-line string is simpler than replacing parts of it.
The output is now shared across branches, regardless of whether or not the deployment is for Hodometer.
A further (arguable) benefit is that YAML document markers are per-document, rather than potentially
creating an empty document; this would be peculiar for anyone viewing Helm template output.

* Refactor Helm template paths to Make vars for concision & consistency

* Add enabled field for Hodometer in Helm values template

* Add Helm templating for Hodometer installation toggle in Makefile

* Regenerate Helm charts

* Update sed script to apply to all resources with hodometer in the name

* Regenerate Helm charts

* Add Helm/k8s configuration section for Hodometer

* Add tab for Helm install of Hodometer for k8s

* Add tab for raw YAML install of Hodometer for k8s

* Shorten section heading for clarity in usage metrics docs

* Add sentence on configuring k8s raw YAML manifests in usage metrics docs

* Add docs section on Compose configuration for Hodometer

* Add section on performance & resource requirements of Hodometer

* Add section on using extra publish URLs with Hodometer

* Use monospace for Hodometer metric names in README table

* Add brief descriptions of op & usage metrics on metrics overview page

* Hide TOC on metrics page as links are given above

* Add architecture diagram for Hodometer

* Add architecture section to usage metrics docs

* Fix typo
  • Loading branch information
agrski authored Jul 6, 2022
1 parent 0ef567f commit 633f1b6
Show file tree
Hide file tree
Showing 10 changed files with 257 additions and 52 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
47 changes: 15 additions & 32 deletions docs/source/contents/metrics/index.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,21 @@
# Metrics

While the system is running we collect metrics via prometheus that allow users to observe differnet aspects of SCv2
with regards to throughout, latency, memory, cpu etc.
There are two kinds of metrics present in Seldon Core v2:
* [operational metrics](./operational.md)
* [usage metrics](./usage.md)

This is in addition to the standard kubernetes like metrics that are scraped by prometheus.
Operational metrics describe the performance of components in the system.
Examples of common operational considerations are memory consumption and CPU usage, request latency and throughput, and cache utilisation rates.
Generally speaking, these are the metrics system administrators, operations teams, and engineers will be interested in.

There is a grafana dashboard (referenced below) that provides an overview of the system.

## List of SCv2 metrics

The list of SCv2 metrics that we are compiling is:

```{literalinclude} ../../../../scheduler/pkg/metrics/prometheus.go
:language: golang
:start-after: // start list of metrics
:end-before: // end list of metrics
```

Many of these metrics are model level counters and gauges. We also aggregate some of these metrics to speed up the display of graphs.

This is experimental and these metrics are bound to change to reflect the trends we want to capture as we get more information about the usage of the system.

## Grafana dashboard

We have a prebuilt grafana dashboard that makes use of many of the metrics that we expose.

![kafka](dashboard.png)

### Local Use

Grafana and Prometheus are available when you run Seldon locally. You will be able to connect to the Grafana dashboard at `http://localhost:3000`. Prometheus will be available at `http://localhost:9090`.

### Kubernetes Installation

Download the dashboard from [SCv2 dashboard](https://github.com/SeldonIO/seldon-core-v2/blob/master/prometheus/dashboards/Seldon%20Core%20Model%20Mesh%20Monitoring.json) and import it in grafana, making sure that the data source is pointing to the correct prometheus store. Find more information on how to import the dashboard [here](https://grafana.com/docs/grafana/latest/dashboards/export-import/)
Usage metrics describe the system at a higher and less dynamic level.
Examples include the number of deployed servers and models, and component versions.
These are not typically metrics that engineers need insight into, but may be relevant to platform providers and operations teams.

```{toctree}
:maxdepth: 1
:hidden:
operational.md
usage.md
```
41 changes: 41 additions & 0 deletions docs/source/contents/metrics/operational.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Operational Metrics

While the system is running we collect metrics via Prometheus that allow users to observe different aspects of SCv2 with regards to throughout, latency, memory, CPU etc.

This is in addition to the standard Kubernetes metrics that are scraped by Prometheus.

There is a Grafana dashboard (referenced below) that provides an overview of the system.

## List of SCv2 metrics

The list of SCv2 metrics that we are compiling is:

```{literalinclude} ../../../../scheduler/pkg/metrics/prometheus.go
:language: golang
:start-after: // start list of metrics
:end-before: // end list of metrics
```

Many of these metrics are model level counters and gauges.
We also aggregate some of these metrics to speed up the display of graphs.

This is experimental and these metrics are bound to change to reflect the trends we want to capture as we get more information about the usage of the system.

## Grafana dashboard

We have a prebuilt Grafana dashboard that makes use of many of the metrics that we expose.

![kafka](dashboard.png)

### Local Use

Grafana and Prometheus are available when you run Seldon locally.
You will be able to connect to the Grafana dashboard at `http://localhost:3000`.
Prometheus will be available at `http://localhost:9090`.

### Kubernetes Installation

Download the dashboard from [SCv2 dashboard](https://github.com/SeldonIO/seldon-core-v2/blob/master/prometheus/dashboards/Seldon%20Core%20Model%20Mesh%20Monitoring.json) and import it in Grafana, making sure that the data source is pointing to the correct Prometheus store.
Find more information on how to import the dashboard [here](https://grafana.com/docs/grafana/latest/dashboards/export-import/).


145 changes: 145 additions & 0 deletions docs/source/contents/metrics/usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Usage Metrics

There are various interesting system metrics about how Seldon Core v2 is used.
These metrics can be recorded **anonymously** and sent to Seldon by a lightweight, optional, stand-alone component called Hodometer.

When provided, these metrics will be used to understand the adoption of Seldon Core v2 and how people interact with it.
For example, knowing how many clusters Seldon Core v2 is running on, if it is used in Kubernetes or for local development, and how many people are benefitting from features like multi-model serving.

## Architecture

![Hodometer architecture](./hodometer-architecture.png)

Hodometer is not an integral part of Seldon Core v2, but rather an independent component which connects to the public APIs of the Seldon Core v2 scheduler.
If deployed in Kubernetes, it will also try to request some basic information from the Kubernetes API.

Recorded metrics are sent to Seldon and, optionally, to any [additional endpoints](#extra-publish-urls) you define.

## Privacy

Hodometer was explicitly designed with privacy of user information and transparency of implementation in mind.

It does not record any sensitive or identifying information.
For example, it has no knowledge of IP addresses, model names, or user information.

All information sent to Seldon is anonymised with a completely random cluster identifier.

Hodometer supports [different information levels](#metrics-levels), so you have full control over what metrics are provided to Seldon, if any.

For transparency, the implementation is fully open-source and designed to be easy to read.
The full source code is available [here](https://github.com/seldonio/seldon-core-v2/tree/master/hodometer), with metrics defined in code [here](https://github.com/seldonio/seldon-core-v2/tree/master/hodometer/pkg/hodometer/metrics.go).
See [below](#list-of-metrics) for an equivalent table of metrics.

## Performance

Metrics are collected as periodic snapshots a few times per day.
They are lightweight to collect, coming mostly from the Seldon Core v2 scheduler, and are heavily aggregated.
As such, they should have minimal impact on CPU, memory, and network consumption.

Hodometer does not store anything it records, so does not have any persistent storage.
As a result, it should not be considered a replacement for tools like Prometheus.

## Configuration

### Metrics levels

Hodometer supports 3 different metrics levels:

| Level | Description |
| --- | --- |
| Cluster | Basic information about the Seldon Core v2 installation |
| Resource | High-level information about which Seldon Core v2 resources are used |
| Feature | More detailed information about how resources are used and whether or not certain feature flags are enabled |

Alternatively, usage metrics can be completely disabled.
To do so, simply remove any existing deployment of Hodometer or disable it in the installation for your environment, discussed below.

### Options

The following environment variables control the behaviour of Hodometer, regardless of the environment it is installed in.

| Flag | Format | Example | Description |
| --- | --- | --- | --- |
| `METRICS_LEVEL` | string | feature | Level of detail for recorded metrics; one of `feature`, `resource`, or `cluster` |
| `EXTRA_PUBLISH_URLS` | comma-separated list of URLs | http://my-endpoint-1:8000,http://my-endpoint-2:8000 | Additional endpoints to publish metrics to |
| `SCHEDULER_HOST` | string | seldon-scheduler | Hostname for Seldon Core v2 scheduler |
| `SCHEDULER_PORT` | integer | 9004 | Port for Seldon Core v2 scheduler |
| `LOG_LEVEL` | string | info | Level of detail for application logs |

### Kubernetes

Hodometer is installed as a separate deployment, by default in the same namespace as the rest of the Seldon components.

`````{tabs}
````{group-tab} Helm
If you install Seldon Core v2 by [Helm chart](../getting-started/kubernetes-installation/helm.md), there are values corresponding to the key environment variables discussed [above](#setting-options).
These Helm values and their equivalents are provided below:
| Helm value | Environment variable |
| --- | --- |
| `hodometer.metricsLevel` | `METRICS_LEVEL` |
| `hodometer.extraPublishUrls` | `EXTRA_PUBLISH_URLS` |
| `hodometer.logLevel` | `LOG_LEVEL` |
If you do not want usage metrics to be recorded, you can disable Hodometer via the `hodometer.enabled` Helm value.
The following command disables collection of usage metrics in fresh installations and also serves to remove Hodometer from an existing installation:
```bash
helm install seldon-core-v2 k8s/helm-charts/seldon-core-v2-setup \
--namespace seldon-mesh \
--set hodometer.enabled=false
```
```{note}
It is a good practice to set Helm values in values file.
These can be applied by using the `-f <filename>` switch when running Helm.
```
````
````{group-tab} YAML
The [raw YAML](../getting-started/kubernetes-installation/raw.md) approach to installing Seldon Core v2 provides an opinionated, pre-configured set of manifests.
Hodometer is automatically enabled with this approach.
You can disable Hodometer by manually removing the appropriate resources before applying the manifests.
If you have an existing installation, you will also need to delete the deployment and, optionally, any of the RBAC resources.
As there is no templating with the raw YAML manifests, you would need to set configuration environment variables manually before deploying them to a cluster.
````
`````

### Docker Compose

The [Compose setup](../getting-started/docker-installation/index.md) provides a pre-configured and opinionated, yet still flexible, approach to using Seldon Core v2.

Hodometer is defined as a service called `hodometer` in the Docker Compose manifest.
It is automatically enabled when running as per the installation instructions.

You can disable Hodometer in Docker Compose by removing the corresponding service from the base manifest.
Alternatively, you can gate it behind a [profile](https://docs.docker.com/compose/profiles/).
If the service is already running, you can stop it directly using `docker-compose stop ...`.

Configuration can be provided by environment variables when running `make` or directly invoking `docker-compose`.
The available variables are defined in the Docker Compose environment file, prefixed with `HODOMETER_`.

### Extra publish URLs

Hodometer can be instructed to publish metrics not only to Seldon, but also to any extra endpoints you specify.
This is controlled by the `EXTRA_PUBLISH_URLS` environment variable, which expects a comma-separated list of HTTP-compatible URLs.

You might choose to use this for your own usage monitoring.
For example, you could capture these metrics and expose them to Prometheus or another monitoring system using your own service.

Metrics are recorded in MixPanel-compatible format, which employs a highly flexible JSON schema.

For an example of how to define your own metrics listener, see the [`receiver` Go package](https://github.com/SeldonIO/seldon-core-v2/tree/master/hodometer/pkg/receiver) in the `hodometer` sub-project.

## List of metrics

```{include} ../../../../hodometer/README.md
:start-after: <!-- start list metrics -->
:end-before: <!-- end list metrics -->
```
38 changes: 21 additions & 17 deletions hodometer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,25 +30,29 @@ There are three levels of metrics that can be enabled:

The set of metrics available at each level is:

<!-- start list metrics -->

| Metric name | Level | Format | Notes |
| --- | --- | --- | --- |
| cluster_id | cluster | UUID | A random identifier for this cluster for de-duplication |
| seldon_core_version | cluster | Version number | E.g. 1.2.3 |
| is_global_installation | cluster | Boolean | Whether installation is global or namespaced |
| is_kubernetes | cluster | Boolean | Whether or not the installation is in Kubernetes |
| kubernetes_version | cluster | Version number | Kubernetes server version, if inside Kubernetes |
| node_count | cluster | Integer | Number of nodes in the cluster, if inside Kubernetes |
| model_count | resource | Integer | Number of `Model` resources |
| pipeline_count | resource | Integer | Number of `Pipeline` resources |
| experiment_count | resource | Integer | Number of `Experiment` resources |
| server_count | resource | Integer | Number of `Server` resources |
| server_replica_count | resource | Integer | Total number of `Server` resource replicas |
| multimodel_enabled_count | feature | Integer | Number of `Server` resources with multi-model serving enabled |
| overcommit_enabled_count | feature | Integer | Number of `Server` resources with overcommitting enabled |
| gpu_enabled_count | feature | Integer | Number of `Server` resources with GPUs attached |
| inference_server_name | feature | String | Name of inference server, e.g. MLServer or Triton |
| server_cpu_cores_sum | feature | Float | Total of CPU limits across all `Server` resource replicas, in cores |
| server_memory_gb_sum | feature | Float | Total of memory limits across all `Server` resource replicas, in GiB |
| `cluster_id` | cluster | UUID | A random identifier for this cluster for de-duplication |
| `seldon_core_version` | cluster | Version number | E.g. 1.2.3 |
| `is_global_installation` | cluster | Boolean | Whether installation is global or namespaced |
| `is_kubernetes` | cluster | Boolean | Whether or not the installation is in Kubernetes |
| `kubernetes_version` | cluster | Version number | Kubernetes server version, if inside Kubernetes |
| `node_count` | cluster | Integer | Number of nodes in the cluster, if inside Kubernetes |
| `model_count` | resource | Integer | Number of `Model` resources |
| `pipeline_count` | resource | Integer | Number of `Pipeline` resources |
| `experiment_count` | resource | Integer | Number of `Experiment` resources |
| `server_count` | resource | Integer | Number of `Server` resources |
| `server_replica_count` | resource | Integer | Total number of `Server` resource replicas |
| `multimodel_enabled_count` | feature | Integer | Number of `Server` resources with multi-model serving enabled |
| `overcommit_enabled_count` | feature | Integer | Number of `Server` resources with overcommitting enabled |
| `gpu_enabled_count` | feature | Integer | Number of `Server` resources with GPUs attached |
| `inference_server_name` | feature | String | Name of inference server, e.g. MLServer or Triton |
| `server_cpu_cores_sum` | feature | Float | Total of CPU limits across all `Server` resource replicas, in cores |
| `server_memory_gb_sum` | feature | Float | Total of memory limits across all `Server` resource replicas, in GiB |

<!-- end list metrics -->

## Privacy

Expand Down
12 changes: 9 additions & 3 deletions k8s/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,18 @@ create-yaml:
kustomize build kustomize/components/ > yaml/seldon-v2-components.yaml
kustomize build kustomize/servers/ > yaml/seldon-v2-servers.yaml

HELM_CRD_BASE := helm-charts/seldon-core-v2-crds/templates
HELM_COMPONENTS_BASE := helm-charts/seldon-core-v2-setup/templates

.PHONY: create-helm-charts
create-helm-charts:
sed "s/#TAG_VERSION_PLACEHOLDER#/${CUSTOM_IMAGE_TAG}/g" helm-charts/seldon-core-v2-setup/values.yaml.template > helm-charts/seldon-core-v2-setup/values.yaml
kustomize build kustomize/helm-crds/ > helm-charts/seldon-core-v2-crds/templates/seldon-v2-crds.yaml
kustomize build kustomize/helm-components/ > helm-charts/seldon-core-v2-setup/templates/seldon-v2-components.yaml
kustomize build kustomize/helm-servers/ > helm-charts/seldon-core-v2-setup/templates/seldon-v2-servers.yaml
kustomize build kustomize/helm-crds/ > ${HELM_CRD_BASE}/seldon-v2-crds.yaml
kustomize build kustomize/helm-components/ > ${HELM_COMPONENTS_BASE}/seldon-v2-components.yaml
kustomize build kustomize/helm-servers/ > ${HELM_COMPONENTS_BASE}/seldon-v2-servers.yaml
sed -n --file=add-helm-toggle-hodometer.sed ${HELM_COMPONENTS_BASE}/seldon-v2-components.yaml \
> ${HELM_COMPONENTS_BASE}/.components.yaml
mv ${HELM_COMPONENTS_BASE}/.components.yaml ${HELM_COMPONENTS_BASE}/seldon-v2-components.yaml

.PHONY: set-chart-version
set-chart-version:
Expand Down
15 changes: 15 additions & 0 deletions k8s/add-helm-toggle-hodometer.sed
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
/^---/ {
x;
/name: .*hodometer/ {
i \{\{ if .Values.hodometer.enabled \}\}
a \{\{ end \}\}
};
1!p;
d
};
H;
${
g;
p;
}

Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
{{ if .Values.hodometer.enabled }}

apiVersion: v1
kind: ServiceAccount
metadata:
name: hodometer
namespace: seldon-mesh
{{ end }}
---
apiVersion: v1
kind: ServiceAccount
Expand Down Expand Up @@ -95,6 +98,7 @@ rules:
- get
- list
- watch
{{ if .Values.hodometer.enabled }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
Expand All @@ -108,6 +112,7 @@ rules:
- nodes
verbs:
- list
{{ end }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
Expand Down Expand Up @@ -370,6 +375,7 @@ subjects:
- kind: ServiceAccount
name: seldon-scheduler
namespace: seldon-mesh
{{ if .Values.hodometer.enabled }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
Expand All @@ -384,6 +390,7 @@ subjects:
- kind: ServiceAccount
name: hodometer
namespace: seldon-mesh
{{ end }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
Expand Down Expand Up @@ -675,6 +682,7 @@ spec:
- containerPort: 9003
name: envoy-admin
terminationGracePeriodSeconds: 5
{{ if .Values.hodometer.enabled }}
---
apiVersion: apps/v1
kind: Deployment
Expand Down Expand Up @@ -715,6 +723,7 @@ spec:
runAsUser: 8888
serviceAccountName: hodometer
terminationGracePeriodSeconds: 5
{{ end }}
---
apiVersion: apps/v1
kind: Deployment
Expand Down
1 change: 1 addition & 0 deletions k8s/helm-charts/seldon-core-v2-setup/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ opentelemetry:
ratio: 1

hodometer:
enabled: true
image:
pullPolicy: IfNotPresent
registry: docker.io
Expand Down
Loading

0 comments on commit 633f1b6

Please sign in to comment.