Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added performance metric grpahs config for nvidia nim #320

Merged

Conversation

TomerFi
Copy link
Contributor

@TomerFi TomerFi commented Nov 28, 2024

Description

Added metrics graphs configuration for NVIDIA NIM runtimes, including logic for identifying said runtimes:

Graph Query
Requests per 5 minutes Number of successful incoming requests
Number of failed incoming requests
Average response time (ms) Average inference latency (not included)
Average e2e latency
CPU utilization % CPU usage
Memory utilization % Memory usage

Currently, NIM runtimes do not report inference latency (see here). Hence, the Average inference latency query is NOT included in this PR.

This PR includes:

  • Modifying the Template for NIM's ServingRuntimes:
    • Adding runtime metadata annotation for identification.
    • Adding runtime spec annotations for ISTIO Prometheus metrics merge.
  • Adding metrics JSON object encapsulating NVIDIA NIM queries.
  • Modified metrics JSON selection process to return the new NVIDIA NIM object for runtimes annotated accordingly.

Jira: NVPE-30.

How Has This Been Tested?

This work was tested against an OpenShift cluster (dev04):

  • I deployed NIM runtime.
  • I executed a couple of requests against the runtime.
  • I connected to the ISTIO sidecar and verified the metrics merge.
  • I opened the related graphs page and verified them (see attached snapshot).

image

Note

Since graphs are currently turned off for NIM runtimes, after enabling locally on my computer, the snapshot was taken from a frontend running on my local against the remote cluster. Jira for enabling: NVPE-18

Note

Building and testing the queries required enabling monitoring for user-defined projects (see here), to make the runtime metrics available from OpenShift Metrics dashboard.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>
@openshift-ci openshift-ci bot requested review from Jooho and rnetser November 28, 2024 15:16
Copy link
Contributor

openshift-ci bot commented Nov 28, 2024

Hi @TomerFi. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Jooho
Copy link
Contributor

Jooho commented Nov 28, 2024

/ok-to-test

@spolti
Copy link
Member

spolti commented Nov 28, 2024

changes looks good to me, just one question, should we align this with the dashboard team?

@TomerFi
Copy link
Contributor Author

TomerFi commented Nov 28, 2024

changes looks good to me, just one question, should we align this with the dashboard team?

@spolti—We're working with them. Currently, nim metrics are disabled in the dashboard. We have a Jira in place to eventually enable them back.

@spolti
Copy link
Member

spolti commented Nov 28, 2024

Okay, thanks.

@Jooho
Copy link
Contributor

Jooho commented Nov 29, 2024

/test

Copy link
Contributor

openshift-ci bot commented Nov 29, 2024

@Jooho: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test images
/test pr-image-mirror
/test unit

Use /test all to run all jobs.

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Jooho
Copy link
Contributor

Jooho commented Nov 29, 2024

/test all

Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a small suggestion. But otherwise it is OK.

If you think that the current code is fine, let me know, and I will approve.

controllers/utils/nim.go Outdated Show resolved Hide resolved
@TomerFi
Copy link
Contributor Author

TomerFi commented Dec 1, 2024

I have a small suggestion. But otherwise it is OK.

If you think that the current code is fine, let me know, and I will approve.

Good idea. I accepted the change suggestion.

@TomerFi TomerFi force-pushed the nvidia-nim-metrics branch from d8a3544 to d761fd2 Compare December 1, 2024 14:12
Co-authored-by: Edgar Hernández <ehernand@redhat.com>
Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>
@TomerFi TomerFi force-pushed the nvidia-nim-metrics branch from d761fd2 to f0cc223 Compare December 1, 2024 14:13
Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Dec 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: israel-hdez, spolti, TomerFi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit aee3e05 into opendatahub-io:incubating Dec 2, 2024
5 checks passed
@TomerFi TomerFi deleted the nvidia-nim-metrics branch December 11, 2024 23:25
openshift-merge-bot bot pushed a commit that referenced this pull request Jan 16, 2025
* update global ca bundle logic and storage-config logic to follow up odh operator pr(1339) (#308)

Signed-off-by: jooho lee <jlee@redhat.com>

* disable dashboard and fix servingruntime display name

Signed-off-by: jooho lee <jlee@redhat.com>

* Use the main branch to build stable image tags, incubating for latest image tags (#316)

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

* [RHOAIENG-13638] - Do not allow isvc creation in protected isvc (#311)

* [RHOAIENG-13638] - Do not allow isvc creation in protected namespace

chore: Fixes [RHOAIENG-13638] - Kserve model is not Ready after a kserve model is created and deleted from istio-system namespace

Signed-off-by: Spolti <fspolti@redhat.com>

* review suggestions

Signed-off-by: Spolti <fspolti@redhat.com>

* Update controllers/webhook/isvc_validator.go

Co-authored-by: Edgar Hernández <ehernand@redhat.com>
Signed-off-by: Spolti <fspolti@redhat.com>

---------

Signed-off-by: Spolti <fspolti@redhat.com>
Co-authored-by: Edgar Hernández <ehernand@redhat.com>

* update gitaction based on branch strategy change (#322)

Signed-off-by: jooho lee <jlee@redhat.com>

* feat: added performance metric grpahs config for nvidia nim (#320)

* feat: added performance metric grpahs config for nvidia nim

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

* chore: modifyed the runtime id annotation

Co-authored-by: Edgar Hernández <ehernand@redhat.com>
Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

---------

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>
Co-authored-by: Edgar Hernández <ehernand@redhat.com>

* Add NIM flag logic (#312)

Signed-off-by: mtrujillo <trujillo169@hotmail.com>

* Grab the old release tag based on creation date

Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>

* Updated the checkout code command

Signed-off-by: Mariah Holder <marholde@marholde-thinkpadp16vgen1.rht.csb>

* Updated the checkout code command (#329)

Signed-off-by: Mariah Holder <marholde@marholde-thinkpadp16vgen1.rht.csb>
Co-authored-by: Mariah Holder <marholde@marholde-thinkpadp16vgen1.rht.csb>

* Add reconciliation for Kserve Raw (#274)

Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>

* chore: added pagination support for nim catalog response (#332)

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

* feat(mr): enable model registry inference reconcile (#326)

Signed-off-by: Alessio Pragliola <seth.pro@gmail.com>

* add upstream release metadata (#333)

Signed-off-by: heyselbi <selbi@redhat.com>

* Migration to kubebuilder v4 (#324)

* Migration to kubebuilder v4

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Restore MR E2Es

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Restore top-level files

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Cleaning

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Fixing Makefile and Containerfile

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Linter fixes

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Initial rework of manifests

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Fix manifests

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Fix lint issues

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Deactivate E2Es

Because setup is not automated, yet.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Feedback: Filippe

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Feedback: Filippe

Test differences after `go mod tidy`

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Apply suggestions from code review: Filippe

Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Feedback: Filippe

* Pin go-toolset base image in Containerfile.
* Add `gosec` linter

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Update config/prometheus/monitor.yaml

Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Feedback: Filippe

* Small change to comments in Makefile, to make the text clearer.
* Remove (again) `gosec` linter

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Fix panic on controller startup

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

---------

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Co-authored-by: Filippe Spolti <filippespolti@gmail.com>

* chore: use naming convention for resources created by nim (#340)

* chore: use naming convention for resources created by nim

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

* test: added assertions for dyamic nim resources name

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

---------

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

* chore: set nim runtime api call page size to 1000 (#344)

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

* Nim enablement change default to managed and add clean up job (#342)

* initial commit for clean up of nim and managed set as default

Signed-off-by: mtrujillo <trujillo169@hotmail.com>

* remove space

Signed-off-by: mtrujillo <trujillo169@hotmail.com>

* fix code length for linting

Signed-off-by: mtrujillo <trujillo169@hotmail.com>

* fixed comments / adjusted import

Signed-off-by: mtrujillo <trujillo169@hotmail.com>

---------

Signed-off-by: mtrujillo <trujillo169@hotmail.com>

* chore: added new graph object for nim runtimes (#334)

* chore: added new graph object for nim runtimes

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

* chore: added REQUEST_OUTCOMES nim graph

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

* chore: added fixed typo in nim query object

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

* chore: fixed typo in nim query object

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

* chore: added initial query for nim gpu cache usage

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

* chore: rewrite queries for nim new graphs

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

---------

Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>

* Update ovms to current build (#343)

Signed-off-by: Steve Grubb <ausearch.1@gmail.com>
Co-authored-by: Steve Grubb <ausearch.1@gmail.com>

* Automatically inject expected ODH annotations to InferenceGraph and InferenceServices (#339)

* Implementation of ODH defaulters for InferenceGraph and InferenceService

On creation of InferenceGraph or InferenceService resources, the following default annotations will be added:
* `serving.knative.openshift.io/enablePassthrough: true`
* `sidecar.istio.io/inject: true`
* `sidecar.istio.io/rewriteAppHTTPProbers: true`

The annotations are added only for Serverless mode, and only if they are missing.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Feedback: Filippe

Extract "ENABLE_WEBHOOKS" string to constant

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

---------

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Authorization for InferenceGraph (Serverless) (#345)

* Authorization for InferenceGraph (Serverless)

This adds a new controller for KServe InferenceGraph resources. This new controller will have the responsibility of creating Authorino AuthConfig resources (similarly to InferenceServices case), when authorization is available in ODH platform.

InferenceGraphs can now be annotated with `security.opendatahub.io/enable-auth: "true"` to secure InferenceGraphs and only serve requests that are authorized.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Feedback: Filippe - Event when auth is not available

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

---------

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* [RHOAIENG-10293] add metrics resources for rawdeployment (#347)

* [RHOAIENG-10293] add metrics resources for rawdeployment

Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>

* [RHOAIENG-10293] address feedback

Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>

---------

Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>

* [RHOAIENG-16851] rawdeployment route bug fixes (#341)

Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>

* fix null pointer error (RHOAIENG-18228) (#349)

Signed-off-by: jooho lee <jlee@redhat.com>

* remove old file

Signed-off-by: jooho lee <jlee@redhat.com>

update go.mod

Signed-off-by: jooho lee <jlee@redhat.com>

---------

Signed-off-by: jooho lee <jlee@redhat.com>
Signed-off-by: Hannah DeFazio <h2defazio@gmail.com>
Signed-off-by: Spolti <fspolti@redhat.com>
Signed-off-by: Tomer Figenblat <tfigenbl@redhat.com>
Signed-off-by: mtrujillo <trujillo169@hotmail.com>
Signed-off-by: Mariah Holder <marholde@marholde-thinkpadp16vgen1.rht.csb>
Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
Signed-off-by: Alessio Pragliola <seth.pro@gmail.com>
Signed-off-by: heyselbi <selbi@redhat.com>
Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Signed-off-by: Steve Grubb <ausearch.1@gmail.com>
Co-authored-by: Hannah DeFazio <h2defazio@gmail.com>
Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
Co-authored-by: Edgar Hernández <ehernand@redhat.com>
Co-authored-by: Tomer Figenblat <tomer.figenblat@gmail.com>
Co-authored-by: Marcus Trujillo <42344046+trujillm@users.noreply.github.com>
Co-authored-by: Mariah Holder <marholde@marholde-thinkpadp16vgen1.rht.csb>
Co-authored-by: Mariah Holder <94134625+mholder6@users.noreply.github.com>
Co-authored-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
Co-authored-by: Tomer Figenblat <tfigenbl@redhat.com>
Co-authored-by: Alessio Pragliola <83355398+Al-Pragliola@users.noreply.github.com>
Co-authored-by: Selbi Nuryyeva <selbi@redhat.com>
Co-authored-by: Steven Grubb <sgrubb@redhat.com>
Co-authored-by: Steve Grubb <ausearch.1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants