Skip to content

Commit

Permalink
Add HPA support to tei, teireranking, tgi services
Browse files Browse the repository at this point in the history
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
  • Loading branch information
byako committed Aug 22, 2024
1 parent c565909 commit ecedfb4
Show file tree
Hide file tree
Showing 22 changed files with 1,219 additions and 17 deletions.
43 changes: 38 additions & 5 deletions helm-charts/chatqna/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,38 @@ helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --

1. Make sure your `MODELDIR` exists on the node where your workload is schedueled so you can cache the downloaded model for next time use. Otherwise, set `global.modelUseHostPath` to 'null' if you don't want to cache the model.

## HorizontalPodAutoscaler (HPA) support

`horizontalPodAutoscaler` option enables HPA scaling for the TGI and TEI inferencing deployments:
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

Autoscaling is based on custom application metrics provided through [Prometheus](https://prometheus.io/).

### Pre-conditions

If cluster does not run [Prometheus operator](https://github.com/prometheus-operator/kube-prometheus)
yet, it SHOULD be be installed before enabling HPA, e.g. by using:
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack

`horizontalPodAutoscaler` enabled in top level Helm chart depending on this component (e.g. `chatqna`),
so that relevant custom metric queries are configured for PrometheusAdapter.

### Gotchas

Why HPA is opt-in:

- Enabling chart `horizontalPodAutoscaler` option will _overwrite_ cluster's current
`PrometheusAdapter` configuration with its own custom metrics configuration.
Take copy of the existing one before install, if that matters:
`kubectl -n monitoring get cm/adapter-config -o yaml > adapter-config.yaml`
- `PrometheusAdapter` needs to be restarted after install, for it to read the new configuration:
`ns=monitoring; kubectl -n $ns delete $(kubectl -n $ns get pod --selector app.kubernetes.io/name=prometheus-adapter -o name)`
- By default Prometheus adds [k8s RBAC rules](https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/prometheus-roleBindingSpecificNamespaces.yaml)
for accessing metrics from `default`, `kube-system` and `monitoring` namespaces. If Helm is
asked to install OPEA services to some other namespace, those rules need to be updated accordingly
- Provided HPA rules are examples for Xeon, for efficient scaling they need to be fine-tuned for given setup
(underlying HW, used models, OPEA version etc)

## Verify

To verify the installation, run the command `kubectl get pod` to make sure all pods are running.
Expand Down Expand Up @@ -83,8 +115,9 @@ Access `http://localhost:5174` to play with the ChatQnA workload through UI.

## Values

| Key | Type | Default | Description |
| ---------------- | ------ | ----------------------------- | ------------------------------------------------------------------------ |
| image.repository | string | `"opea/chatqna"` | |
| service.port | string | `"8888"` | |
| tgi.LLM_MODEL_ID | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory |
| Key | Type | Default | Description |
| -------------------------------------- | ------ | ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| image.repository | string | `"opea/chatqna"` | |
| service.port | string | `"8888"` | |
| tgi.LLM_MODEL_ID | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory |
| global.horizontalPodAutoscaler.enabled | bop; | false | HPA autoscaling for the TGI and TEI service deployments based on metrics they provide. See #pre-conditions and #gotchas before enabling! (If one doesn't want one of them to be scaled, given service `maxReplicas` can be set to `1`) |
53 changes: 53 additions & 0 deletions helm-charts/chatqna/templates/customMetrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

{{- if .Values.global.horizontalPodAutoscaler.enabled }}
apiVersion: v1
data:
config.yaml: |
rules:
- seriesQuery: '{__name__="tgi_request_inference_duration_sum",service="{{ include "tgi.fullname" .Subcharts.tgi }}"}'
# Average request latency from TGI histograms, over 1 min
# (0.001 divider add is to make sure there's always a valid value)
metricsQuery: 'rate(tgi_request_inference_duration_sum{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(tgi_request_inference_duration_count{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]))'
name:
matches: ^tgi_request_inference_duration_sum
as: "tgi_request_latency"
resources:
# HPA needs both namespace + suitable object resource for its query paths:
# /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/tgi_request_latency
# (pod is not suitable object type for matching as each instance has different name)
overrides:
namespace:
resource: namespace
service:
resource: service
- seriesQuery: '{__name__="te_request_inference_duration_sum",service="{{ include "teirerank.fullname" .Subcharts.teirerank }}"}'
# Average request latency from TEI histograms, over 1 min
metricsQuery: 'rate(te_request_inference_duration_sum{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>}[1m]))'
name:
matches: ^te_request_inference_duration_sum
as: "reranking_request_latency"
resources:
overrides:
namespace:
resource: namespace
service:
resource: service
- seriesQuery: '{__name__="te_request_inference_duration_sum",service="{{ include "tei.fullname" .Subcharts.tei }}"}'
# Average request latency from TEI histograms, over 1 min
metricsQuery: 'rate(te_request_inference_duration_sum{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>}[1m]))'
name:
matches: ^te_request_inference_duration_sum
as: "embedding_request_latency"
resources:
overrides:
namespace:
resource: namespace
service:
resource: service
kind: ConfigMap
metadata:
name: adapter-config
namespace: monitoring
{{- end }}
8 changes: 8 additions & 0 deletions helm-charts/chatqna/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,11 @@ global:
modelUseHostPath: ""
# modelUseHostPath: /mnt/opea-models
# modelUsePVC: model-volume

# Enabling HorizontalPodAutoscaler (HPA) will:
# - Overwrite existing PrometheusAdapter "adapter-config" configMap with ChatQnA specific custom metric queries
# for embedding, reranking, tgi services
# Upstream default configMap:
# - https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml
horizontalPodAutoscaler:
enabled: false
74 changes: 68 additions & 6 deletions helm-charts/common/tei/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,38 @@ MODELDIR=/mnt/opea-models

MODELNAME="/data/BAAI/bge-base-en-v1.5"

## HorizontalPodAutoscaler (HPA) support

`horizontalPodAutoscaler` option enables HPA scaling for the deployment:
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

Autoscaling is based on custom application metrics provided through [Prometheus](https://prometheus.io/).

### Pre-conditions

If cluster does not run [Prometheus operator](https://github.com/prometheus-operator/kube-prometheus)
yet, it SHOULD be be installed before enabling HPA, e.g. by using:
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack

`horizontalPodAutoscaler` enabled in top level Helm chart depending on this component (e.g. `chatqna`),
so that relevant custom metric queries are configured for PrometheusAdapter.

### Gotchas

Why HPA is opt-in:

- Enabling chart `horizontalPodAutoscaler` option will _overwrite_ cluster's current
`PrometheusAdapter` configuration with its own custom metrics configuration.
Take copy of the existing one before install, if that matters:
`kubectl -n monitoring get cm/adapter-config -o yaml > adapter-config.yaml`
- `PrometheusAdapter` needs to be restarted after install, for it to read the new configuration:
`ns=monitoring; kubectl -n $ns delete $(kubectl -n $ns get pod --selector app.kubernetes.io/name=prometheus-adapter -o name)`
- By default Prometheus adds [k8s RBAC rules](https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/prometheus-roleBindingSpecificNamespaces.yaml)
for accessing metrics from `default`, `kube-system` and `monitoring` namespaces. If Helm is
asked to install OPEA services to some other namespace, those rules need to be updated accordingly
- Provided HPA rules are examples for Xeon, for efficient scaling they need to be fine-tuned for given setup
(underlying HW, used models, OPEA version etc)

## Verify

To verify the installation, run the command `kubectl get pod` to make sure all pods are runinng.
Expand All @@ -33,11 +65,41 @@ Open another terminal and run the following command to verify the service if wor
curl http://localhost:2081/embed -X POST -d '{"inputs":"What is Deep Learning?"}' -H 'Content-Type: application/json'
```

### Verify HPA metrics

To verify that metrics required by horizontalPodAutoscaler option work, check that:

Prometheus has found the metric endpoints, i.e. last number on the line is non-zero:

```console
prom_url=http://$(kubectl -n monitoring get -o jsonpath="{.spec.clusterIP}:{.spec.ports[0].port}" svc/prometheus-k8s)
curl --no-progress-meter $prom_url/metrics | grep scrape_pool_targets.*tei
```

Prometheus adapter provides custom metrics for their data:

```console
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .resources[].name
```

And those custom metrics have valid values for HPA rules:

```console
ns=default; # OPEA namespace
url=/apis/custom.metrics.k8s.io/v1beta1;
for m in $(kubectl get --raw $url | jq .resources[].name | cut -d/ -f2 | sort -u | tr -d '"'); do
kubectl get --raw $url/namespaces/$ns/metrics/$m | jq;
done | grep -e metricName -e value
```

NOTE: HuggingFace TGI and TEI services provide metrics endpoint only after they've processed their first request!

## Values

| Key | Type | Default | Description |
| ----------------------- | ------ | ------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| EMBEDDING_MODEL_ID | string | `"BAAI/bge-base-en-v1.5"` | Models id from https://huggingface.co/, or predownloaded model directory |
| global.modelUseHostPath | string | `"/mnt/opea-models"` | Cached models directory, tei will not download if the model is cached here. The host path "modelUseHostPath" will be mounted to container as /data directory. Set this to null/empty will force it to download model. |
| image.repository | string | `"ghcr.io/huggingface/text-embeddings-inference"` | |
| image.tag | string | `"cpu-1.5"` | |
| Key | Type | Default | Description |
| ------------------------------- | ------ | ------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| EMBEDDING_MODEL_ID | string | `"BAAI/bge-base-en-v1.5"` | Models id from https://huggingface.co/, or predownloaded model directory |
| global.modelUseHostPath | string | `"/mnt/opea-models"` | Cached models directory, tei will not download if the model is cached here. The host path "modelUseHostPath" will be mounted to container as /data directory. Set this to null/empty will force it to download model. |
| image.repository | string | `"ghcr.io/huggingface/text-embeddings-inference"` | |
| image.tag | string | `"cpu-1.5"` | |
| horizontalPodAutoscaler.enabled | bool | false | Enable HPA autoscaling for the service deployments based on metrics it provides. See #pre-conditions and #gotchas before enabling! |
7 changes: 7 additions & 0 deletions helm-charts/common/tei/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ metadata:
labels:
{{- include "tei.labels" . | nindent 4 }}
spec:
# use explicit replica counts only of HorizontalPodAutoscaler is disabled
{{- if not .Values.global.horizontalPodAutoscaler.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "tei.selectorLabels" . | nindent 6 }}
Expand Down Expand Up @@ -102,3 +105,7 @@ spec:
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.global.horizontalPodAutoscaler.enabled }}
# extra time to finish processing buffered requests before HPA forcibly terminates pod
terminationGracePeriodSeconds: 60
{{- end }}
51 changes: 51 additions & 0 deletions helm-charts/common/tei/templates/horizontalPodAutoscaler.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

{{- if .Values.global.horizontalPodAutoscaler.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "tei.fullname" . }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "tei.fullname" . }}
minReplicas: 1
maxReplicas: {{ .Values.horizontalPodAutoscaler.maxReplicas }}
metrics:
- type: Object
object:
metric:
# tei-embedding time metrics are in seconds
name: embedding_request_latency
describedObject:
apiVersion: v1
# get metric for named object of given type (in same namespace)
kind: Service
name: {{ include "tei.fullname" . }}
target:
# embedding_request_latency is average for all TEI pods. To avoid replica fluctuations when
# TEI startup + request processing takes longer than HPA evaluation period, this uses
# "Value" (replicas = metric.value / target.value), instead of "averageValue" type:
# https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details
type: Value
value: 4
behavior:
scaleDown:
stabilizationWindowSeconds: 180
policies:
- type: Percent
value: 25
periodSeconds: 15
scaleUp:
selectPolicy: Max
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 50
periodSeconds: 15
- type: Pods
value: 2
periodSeconds: 15
{{- end }}
17 changes: 17 additions & 0 deletions helm-charts/common/tei/templates/servicemonitor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

{{- if .Values.global.horizontalPodAutoscaler.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ include "tei.fullname" . }}
spec:
selector:
matchLabels:
{{- include "tei.selectorLabels" . | nindent 6 }}
endpoints:
- interval: 4s
port: tei
scheme: http
{{- end }}
9 changes: 9 additions & 0 deletions helm-charts/common/tei/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@

replicaCount: 1

horizontalPodAutoscaler:
maxReplicas: 2

port: 2081
shmSize: 1Gi
EMBEDDING_MODEL_ID: "BAAI/bge-base-en-v1.5"
Expand Down Expand Up @@ -92,3 +95,9 @@ global:
# By default, both var are set to empty, the model will be downloaded and saved to a tmp volume.
modelUseHostPath: ""
modelUsePVC: ""
# Enabling HPA will:
# - Ignore above replica count, as it will be controlled by HPA
# - Add example HPA scaling rules with thresholds suitable for Xeon deployments
# - Require custom metrics ConfigMap available in the main application chart
horizontalPodAutoscaler:
enabled: false
Loading

0 comments on commit ecedfb4

Please sign in to comment.