Support alternative metrics on accelerated TGI / TEI instances #454

eero-t · 2024-09-23T18:28:53Z

Description

Changes:

Provide accelDevice Helm variables also in other charts using accelerated TGI / TEI
Use different custom / HPA metrics depending on whether TGI / TEI is accelerated

Queue size metric is a bit better for scaling TGI Gaudi instances than the one used for Xeon scaling, and queries for it are simpler. HPA rule updates act also as examples on use of different HPA algorithm types (Value vs. AverageValue).

Issues

n/a.

Type of change

New feature (non-breaking change which adds new functionality)

Dependencies

n/a.

Tests

Did manual testing of HPA scaling of 1-4 TGI Gaudi instances (4 = maxReplicas value), using different TGI-provided metrics, and batch sizes.

PS. none of the TGI / TEI metrics seem that good for scaling, their values mostly indicate just whether given component is stressed, not how slow the resulting (e.g. ChatQnA) queries are. Meaning that HPA scaling factor causes scale up when they get loaded, and it drops extra instances when load goes almost completely away, but that scaling can be quite far from a really optimal one.

Something like opea-project/GenAIExamples#391 (and fine-tuning the HPA values for given model, data type, OPEA version etc) would be needed for more optimal scaling.

lianhao · 2024-09-24T03:14:13Z

@eero-t could you double check the CI failure on xeon with cpu-values.yaml? You may try to reproduce and debug locally by helm install chatqna . -f cpu_values.yaml to install chatqna on xeon platform, and helm test chatqna to figure out if something is failing

eero-t · 2024-09-24T08:40:33Z

Failure in CI CPU values file test is for the TGI service response check: https://github.com/opea-project/GenAIInfra/actions/runs/10991943008/job/30556825567?pr=451

This PR changes only HPA files and CI skips HPA testing, so the reason for the failure is some change done before this PR.

What CPU values file does

CPU values file overrides deployment probe timings and resources from defaults: https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/tgi/values.yaml#L58

So that ChatQnA service can be scaled better without failing, as scheduler knows how much resources they need, and kubelet uses higher QoS class for them: https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/

Increased probe timings cannot be the cause for failure, and memory limits should be large enough, not to be an issue either. CPU limits are quite tight though (8 CPUs), to allow running multiple TGI instances on same node, e.g. with NRI policies.

Reason for fail

CI test checks TGI log output, but appears to be doing it too soon.

CPU limits can can explain slowdown from default (no resources specified), when compared to situation where only single TGI instance is running / node has plenty of resources free.

But CPU values file passed CI earlier, so something else has changed since it was merged; either in CI setup, or the TGI setup (other than values specified in CPU file).

Potential workarounds

Things that can be done (in some other PR):

CI:
- increase time for checking TGI log, or
- return CI setup to earlier state where TGI startup was faster
CPU values file:
- increase (TGI) CPU resources

yongfengdu · 2024-09-24T11:51:15Z

Another possibility is the latest image updated from docker hub, which might cause different behaviors.
The returned output seems correct, however, helm test for llm-uservice failed.
Need further investigation on this.

eero-t · 2024-09-24T11:58:50Z

Reviewed tests that were run for all PRs that were merged since cpu-values.yaml was added in #386.

=> this and #451 are first PRs for which CI tested cpu-values.yaml, despite there being several merged PRs which could have broken things for it already earlier, as they updated CPU / Xeon related files in ChatQnA and its dependencies (e.g. #444).

Additionally, when looking at the passing Helm E2E tests in that merged #386 PR, e.g. for "xeon, values, common": https://github.com/opea-project/GenAIInfra/actions/runs/10800184120/job/29957611425?pr=386

There are lot of "Couldn't connect to server" errors (e.g. for data-prep), but the test passed despite failures in functionality not touched by that PR.

Then when looking at same passing test in later PRs, tests do not log any more what they are doing: https://github.com/opea-project/GenAIInfra/actions/runs/10940026240/job/30371535827?pr=444

I.e. it seems that CI tests cannot be trusted to guarantee any of the functionality working, or catching when that functionality actually broke.

=> Somebody (not me) needs to bisect repo content in the CI environment to see when things actually broke.

eero-t · 2024-09-24T12:02:52Z

Additionally, E2E tests pass the tests even if log file content was wrong, if the log file had wrong name, see:
https://github.com/opea-project/GenAIInfra/blob/main/.github/workflows/_helm-e2e.yaml#L142

lianhao · 2024-09-25T00:53:36Z

Additionally, when looking at the passing Helm E2E tests in that merged #386 PR, e.g. for "xeon, values, common": https://github.com/opea-project/GenAIInfra/actions/runs/10800184120/job/29957611425?pr=386

There are lot of "Couldn't connect to server" errors (e.g. for data-prep), but the test passed despite failures in functionality not touched by that PR.

This is a workaround in the testpod to avoid some opea services' probe health check status doesn't got populated in the K8S service availability quickly enough. Sleeping for a short period of time between 'helm install --wait' and 'helm test' could be another option.

Then when looking at same passing test in later PRs, tests do not log any more what they are doing: https://github.com/opea-project/GenAIInfra/actions/runs/10940026240/job/30371535827?pr=444

What does the test do is in each helm chart's template/tests/test-pod.yaml. The log only shows the test output.

I.e. it seems that CI tests cannot be trusted to guarantee any of the functionality working, or catching when that functionality actually broke.

lianhao · 2024-09-25T07:16:26Z

ok, I root caused this issue. GenAIComps PR 608 plus cpu-values significantly increases the test run time, which trigger the default 5m timeout of helm test. Please wait for PR #456 to land-in first

DocSum defaults to same model as ChatQaA, and default model used by CodeGen + CodeTrans is also 7b one, so tgi.accelDevice impact is assumed to be close enough. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>

eero-t · 2024-09-25T08:47:18Z

ok, I root caused this issue. GenAIComps PR 608 plus cpu-values significantly increases the test run time, which trigger the default 5m timeout of helm test. Please wait for PR #456 to land-in first

How much cpu-values.yaml increased the test time with PR 608? If it's really significant, it would be better to raise (also) the CPU limits.

(The values I've used may be too small. At that time I had only couple of Xeon nodes for scaling testing, and smaller values allowed stuffing more instances to a single Xeon node. However, I used also NRI to isolate pods better, which is not the default k8s setup...)

eero-t · 2024-09-25T09:09:58Z

This is a workaround in the testpod to avoid some opea services' probe health check status doesn't got populated in the K8S service availability quickly enough.

Tests should be predictable and fail on errors, not paper over them.

Sleeping for a short period of time between 'helm install --wait' and 'helm test' could be another option.

Script should wait until pod is ready to accept input before trying to use it, or fail test if wait timeouts.

For example, this fails unless there are matching TGI pods, and all of them are in Ready state, within given timeout:

if ! kubectl wait --for=condition=Ready --timeout=60s --selector=app.kubernetes.io/name=tgi pod; then
  echo "ERROR: TGI pods not ready within 60s"
  exit 1
fi

What does the test do is in each helm chart's template/tests/test-pod.yaml. The log only shows the test output.

That's not really good enough. One would need to dig out all the relevant test template files from the given PR branch, go through all relevant values files and deduct results for all variables & conditionals in the template. That's way too much work and error prone.

Tests should log what they're actually testing, otherwise they're not worth much. Without logging it's nightmare to verify that they correctly test everything that's relevant (+ preferably skip completely irrelevant things), and to debug failures.

lianhao · 2024-09-25T09:19:39Z

Script should wait until pod is ready to accept input before trying to use it, or fail test if wait timeouts.

For example, this fails unless there are matching TGI pods, and all of them are in Ready state, within given timeout:
if ! kubectl wait --for=condition=Ready --timeout=60s --selector=app.kubernetes.io/name=tgi pod; then
  echo "ERROR: TGI pods not ready within 60s"
  exit 1
fi

the current helm install --wait actually have already done exactly the same as the above, but for all pods I think.

Tests should log what they're actually testing, otherwise they're not worth much. Without logging it's nightmare to verify >that they correctly test everything that's relevant (+ preferably skip completely irrelevant things), and to debug failures.

That can be done in another PR.

eero-t · 2024-09-25T09:25:13Z

the current helm install --wait actually have already done exactly the same as the above, but for all pods I think,

Oh, that waits also for Ready state. Yes, that would work:

$ helm install -h
...
--wait if set, will wait until all Pods, PVCs, Services, and minimum number of Pods of a Deployment, StatefulSet, or ReplicaSet are in a ready state before marking the release as successful. It will wait for as long as --timeout

EDIT: compared to Helm --waiting until everything is ready, kubectl waiting only the to-be-used service to be ready before using it, might speed CI up a bit. It's more error prone though [1], so I think helm install --wait is better idea.

[1] as service might fail due to another service it uses not being ready yet, and those dependencies, or their names (potentially) changing based on which Helm chart is being tested.

yongfengdu

It looks to me we shouldn't use "helm install -f cpu-values.yaml" for test since the cpu-values.yaml is for HPA usage.
Maybe we can consider rename it to cpu-hpa-values.yaml or something.

eero-t · 2024-09-30T12:40:26Z

It looks to me we shouldn't use "helm install -f cpu-values.yaml" for test since the cpu-values.yaml is for HPA usage. Maybe we can consider rename it to cpu-hpa-values.yaml or something.

No.

@yongfengdu cpu-values.yaml is for running OPEA services on CPU, it's not (specifically) for HPA. K8s scheduler can reliably assign pods to suitable nodes only if pod spec states how much resources given pod's containers need.

cpu-values.yaml use is recommended for HPA because lack of suitable resource requests (and realistic probe timings) gets more visible the more there are badly specified pods with huge resource usage. However, HPA scale-up is just one of the reasons for there being more such pods, those extra pods could e.g. belong to some completely different service, or e.g. other (people's / departments') ChatQnA instances.

In addition to pods not getting scheduled properly without appropriate resource requests, they also go to lowest QoS class without resources them, i.e. get throttled & evicted first from the nodes.

PS. cpu-values.yaml is just a quick workaround for ChatQnA, real solution for the problem is discussed in: #431

eero-t requested review from yongfengdu and lianhao as code owners September 23, 2024 18:28

eero-t mentioned this pull request Sep 24, 2024

Update the tgi service images for gaudi #451

Merged

3 tasks

eero-t added 2 commits September 25, 2024 16:15

Add tgi.accelDevice to rest of top-level gaudi-values.yaml files

f19ab92

DocSum defaults to same model as ChatQaA, and default model used by CodeGen + CodeTrans is also 7b one, so tgi.accelDevice impact is assumed to be close enough. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>

Different TGI/TEI custom metrics & HPA rules for accelerated devices

55f35da

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>

lianhao force-pushed the hpa-gaudi branch from 8a127b1 to 55f35da Compare September 25, 2024 08:15

lianhao approved these changes Sep 25, 2024

View reviewed changes

yongfengdu approved these changes Sep 27, 2024

View reviewed changes

yongfengdu merged commit cdd3585 into opea-project:main Sep 27, 2024
24 checks passed

eero-t mentioned this pull request Oct 18, 2024

ChatQnA: accelerate also teirerank with Gaudi #475

Open

1 task

eero-t deleted the hpa-gaudi branch October 18, 2024 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support alternative metrics on accelerated TGI / TEI instances #454

Support alternative metrics on accelerated TGI / TEI instances #454

eero-t commented Sep 23, 2024

lianhao commented Sep 24, 2024

eero-t commented Sep 24, 2024 •

edited

Loading

yongfengdu commented Sep 24, 2024

eero-t commented Sep 24, 2024

eero-t commented Sep 24, 2024

lianhao commented Sep 25, 2024 •

edited

Loading

lianhao commented Sep 25, 2024

eero-t commented Sep 25, 2024 •

edited

Loading

eero-t commented Sep 25, 2024 •

edited

Loading

lianhao commented Sep 25, 2024 •

edited

Loading

eero-t commented Sep 25, 2024 •

edited

Loading

yongfengdu left a comment

eero-t commented Sep 30, 2024

Support alternative metrics on accelerated TGI / TEI instances #454

Support alternative metrics on accelerated TGI / TEI instances #454

Conversation

eero-t commented Sep 23, 2024

Description

Issues

Type of change

Dependencies

Tests

lianhao commented Sep 24, 2024

eero-t commented Sep 24, 2024 • edited Loading

yongfengdu commented Sep 24, 2024

eero-t commented Sep 24, 2024

eero-t commented Sep 24, 2024

lianhao commented Sep 25, 2024 • edited Loading

lianhao commented Sep 25, 2024

eero-t commented Sep 25, 2024 • edited Loading

eero-t commented Sep 25, 2024 • edited Loading

lianhao commented Sep 25, 2024 • edited Loading

eero-t commented Sep 25, 2024 • edited Loading

yongfengdu left a comment

Choose a reason for hiding this comment

eero-t commented Sep 30, 2024

eero-t commented Sep 24, 2024 •

edited

Loading

lianhao commented Sep 25, 2024 •

edited

Loading

eero-t commented Sep 25, 2024 •

edited

Loading

eero-t commented Sep 25, 2024 •

edited

Loading

lianhao commented Sep 25, 2024 •

edited

Loading

eero-t commented Sep 25, 2024 •

edited

Loading