Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support alternative metrics on accelerated TGI / TEI instances #454

Merged
merged 2 commits into from
Sep 27, 2024

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Sep 23, 2024

Description

Changes:

  • Provide accelDevice Helm variables also in other charts using accelerated TGI / TEI
  • Use different custom / HPA metrics depending on whether TGI / TEI is accelerated

Queue size metric is a bit better for scaling TGI Gaudi instances than the one used for Xeon scaling, and queries for it are simpler. HPA rule updates act also as examples on use of different HPA algorithm types (Value vs. AverageValue).

Issues

n/a.

Type of change

  • New feature (non-breaking change which adds new functionality)

Dependencies

n/a.

Tests

Did manual testing of HPA scaling of 1-4 TGI Gaudi instances (4 = maxReplicas value), using different TGI-provided metrics, and batch sizes.

PS. none of the TGI / TEI metrics seem that good for scaling, their values mostly indicate just whether given component is stressed, not how slow the resulting (e.g. ChatQnA) queries are. Meaning that HPA scaling factor causes scale up when they get loaded, and it drops extra instances when load goes almost completely away, but that scaling can be quite far from a really optimal one.

Something like opea-project/GenAIExamples#391 (and fine-tuning the HPA values for given model, data type, OPEA version etc) would be needed for more optimal scaling.

@lianhao
Copy link
Collaborator

lianhao commented Sep 24, 2024

@eero-t could you double check the CI failure on xeon with cpu-values.yaml? You may try to reproduce and debug locally by helm install chatqna . -f cpu_values.yaml to install chatqna on xeon platform, and helm test chatqna to figure out if something is failing

@eero-t
Copy link
Contributor Author

eero-t commented Sep 24, 2024

Failure in CI CPU values file test is for the TGI service response check: https://github.com/opea-project/GenAIInfra/actions/runs/10991943008/job/30556825567?pr=451

This PR changes only HPA files and CI skips HPA testing, so the reason for the failure is some change done before this PR.

What CPU values file does

CPU values file overrides deployment probe timings and resources from defaults: https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/tgi/values.yaml#L58

So that ChatQnA service can be scaled better without failing, as scheduler knows how much resources they need, and kubelet uses higher QoS class for them: https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/

Increased probe timings cannot be the cause for failure, and memory limits should be large enough, not to be an issue either. CPU limits are quite tight though (8 CPUs), to allow running multiple TGI instances on same node, e.g. with NRI policies.

Reason for fail

CI test checks TGI log output, but appears to be doing it too soon.

CPU limits can can explain slowdown from default (no resources specified), when compared to situation where only single TGI instance is running / node has plenty of resources free.

But CPU values file passed CI earlier, so something else has changed since it was merged; either in CI setup, or the TGI setup (other than values specified in CPU file).

Potential workarounds

Things that can be done (in some other PR):

  • CI:
    • increase time for checking TGI log, or
    • return CI setup to earlier state where TGI startup was faster
  • CPU values file:
    • increase (TGI) CPU resources

@yongfengdu
Copy link
Collaborator

Another possibility is the latest image updated from docker hub, which might cause different behaviors.
The returned output seems correct, however, helm test for llm-uservice failed.
Need further investigation on this.

@eero-t
Copy link
Contributor Author

eero-t commented Sep 24, 2024

Reviewed tests that were run for all PRs that were merged since cpu-values.yaml was added in #386.

=> this and #451 are first PRs for which CI tested cpu-values.yaml, despite there being several merged PRs which could have broken things for it already earlier, as they updated CPU / Xeon related files in ChatQnA and its dependencies (e.g. #444).

Additionally, when looking at the passing Helm E2E tests in that merged #386 PR, e.g. for "xeon, values, common": https://github.com/opea-project/GenAIInfra/actions/runs/10800184120/job/29957611425?pr=386

There are lot of "Couldn't connect to server" errors (e.g. for data-prep), but the test passed despite failures in functionality not touched by that PR.

Then when looking at same passing test in later PRs, tests do not log any more what they are doing: https://github.com/opea-project/GenAIInfra/actions/runs/10940026240/job/30371535827?pr=444

I.e. it seems that CI tests cannot be trusted to guarantee any of the functionality working, or catching when that functionality actually broke.

=> Somebody (not me) needs to bisect repo content in the CI environment to see when things actually broke.

@eero-t
Copy link
Contributor Author

eero-t commented Sep 24, 2024

Additionally, E2E tests pass the tests even if log file content was wrong, if the log file had wrong name, see:
https://github.com/opea-project/GenAIInfra/blob/main/.github/workflows/_helm-e2e.yaml#L142

@lianhao
Copy link
Collaborator

lianhao commented Sep 25, 2024

Additionally, when looking at the passing Helm E2E tests in that merged #386 PR, e.g. for "xeon, values, common": https://github.com/opea-project/GenAIInfra/actions/runs/10800184120/job/29957611425?pr=386

There are lot of "Couldn't connect to server" errors (e.g. for data-prep), but the test passed despite failures in functionality not touched by that PR.

This is a workaround in the testpod to avoid some opea services' probe health check status doesn't got populated in the K8S service availability quickly enough. Sleeping for a short period of time between 'helm install --wait' and 'helm test' could be another option.

Then when looking at same passing test in later PRs, tests do not log any more what they are doing: https://github.com/opea-project/GenAIInfra/actions/runs/10940026240/job/30371535827?pr=444

What does the test do is in each helm chart's template/tests/test-pod.yaml. The log only shows the test output.

I.e. it seems that CI tests cannot be trusted to guarantee any of the functionality working, or catching when that functionality actually broke.

@lianhao
Copy link
Collaborator

lianhao commented Sep 25, 2024

ok, I root caused this issue. GenAIComps PR 608 plus cpu-values significantly increases the test run time, which trigger the default 5m timeout of helm test. Please wait for PR #456 to land-in first

DocSum defaults to same model as ChatQaA, and default model used by
CodeGen + CodeTrans is also 7b one, so tgi.accelDevice impact is
assumed to be close enough.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@eero-t
Copy link
Contributor Author

eero-t commented Sep 25, 2024

ok, I root caused this issue. GenAIComps PR 608 plus cpu-values significantly increases the test run time, which trigger the default 5m timeout of helm test. Please wait for PR #456 to land-in first

How much cpu-values.yaml increased the test time with PR 608? If it's really significant, it would be better to raise (also) the CPU limits.

(The values I've used may be too small. At that time I had only couple of Xeon nodes for scaling testing, and smaller values allowed stuffing more instances to a single Xeon node. However, I used also NRI to isolate pods better, which is not the default k8s setup...)

@eero-t
Copy link
Contributor Author

eero-t commented Sep 25, 2024

This is a workaround in the testpod to avoid some opea services' probe health check status doesn't got populated in the K8S service availability quickly enough.

Tests should be predictable and fail on errors, not paper over them.

Sleeping for a short period of time between 'helm install --wait' and 'helm test' could be another option.

Script should wait until pod is ready to accept input before trying to use it, or fail test if wait timeouts.

For example, this fails unless there are matching TGI pods, and all of them are in Ready state, within given timeout:

if ! kubectl wait --for=condition=Ready --timeout=60s --selector=app.kubernetes.io/name=tgi pod; then
  echo "ERROR: TGI pods not ready within 60s"
  exit 1
fi

What does the test do is in each helm chart's template/tests/test-pod.yaml. The log only shows the test output.

That's not really good enough. One would need to dig out all the relevant test template files from the given PR branch, go through all relevant values files and deduct results for all variables & conditionals in the template. That's way too much work and error prone.

Tests should log what they're actually testing, otherwise they're not worth much. Without logging it's nightmare to verify that they correctly test everything that's relevant (+ preferably skip completely irrelevant things), and to debug failures.

@lianhao
Copy link
Collaborator

lianhao commented Sep 25, 2024

Script should wait until pod is ready to accept input before trying to use it, or fail test if wait timeouts.

For example, this fails unless there are matching TGI pods, and all of them are in Ready state, within given timeout:

if ! kubectl wait --for=condition=Ready --timeout=60s --selector=app.kubernetes.io/name=tgi pod; then
  echo "ERROR: TGI pods not ready within 60s"
  exit 1
fi

the current helm install --wait actually have already done exactly the same as the above, but for all pods I think.

Tests should log what they're actually testing, otherwise they're not worth much. Without logging it's nightmare to verify >that they correctly test everything that's relevant (+ preferably skip completely irrelevant things), and to debug failures.

That can be done in another PR.

@eero-t
Copy link
Contributor Author

eero-t commented Sep 25, 2024

the current helm install --wait actually have already done exactly the same as the above, but for all pods I think,

Oh, that waits also for Ready state. Yes, that would work:

$ helm install -h
...
--wait if set, will wait until all Pods, PVCs, Services, and minimum number of Pods of a Deployment, StatefulSet, or ReplicaSet are in a ready state before marking the release as successful. It will wait for as long as --timeout

EDIT: compared to Helm --waiting until everything is ready, kubectl waiting only the to-be-used service to be ready before using it, might speed CI up a bit. It's more error prone though [1], so I think helm install --wait is better idea.

[1] as service might fail due to another service it uses not being ready yet, and those dependencies, or their names (potentially) changing based on which Helm chart is being tested.

Copy link
Collaborator

@yongfengdu yongfengdu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me we shouldn't use "helm install -f cpu-values.yaml" for test since the cpu-values.yaml is for HPA usage.
Maybe we can consider rename it to cpu-hpa-values.yaml or something.

@yongfengdu yongfengdu merged commit cdd3585 into opea-project:main Sep 27, 2024
24 checks passed
@eero-t
Copy link
Contributor Author

eero-t commented Sep 30, 2024

It looks to me we shouldn't use "helm install -f cpu-values.yaml" for test since the cpu-values.yaml is for HPA usage. Maybe we can consider rename it to cpu-hpa-values.yaml or something.

No.

@yongfengdu cpu-values.yaml is for running OPEA services on CPU, it's not (specifically) for HPA. K8s scheduler can reliably assign pods to suitable nodes only if pod spec states how much resources given pod's containers need.

cpu-values.yaml use is recommended for HPA because lack of suitable resource requests (and realistic probe timings) gets more visible the more there are badly specified pods with huge resource usage. However, HPA scale-up is just one of the reasons for there being more such pods, those extra pods could e.g. belong to some completely different service, or e.g. other (people's / departments') ChatQnA instances.

In addition to pods not getting scheduled properly without appropriate resource requests, they also go to lowest QoS class without resources them, i.e. get throttled & evicted first from the nodes.

PS. cpu-values.yaml is just a quick workaround for ChatQnA, real solution for the problem is discussed in: #431

@eero-t eero-t deleted the hpa-gaudi branch October 18, 2024 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants