Skip to content

Commit

Permalink
9179 Refactoring accelerator content (#395)
Browse files Browse the repository at this point in the history
* Initial title reorg.

* Refactoring accelerator content. Adding new notes for model serving RE supported GPUs.

* Renaming NVIDIA GPU prodedure module file.

* Further refactoring, removing dupe profile content for Intel.

* Moving end note in Intel enablement module.

* Minor wording and formatting changes.

* In flight work on new NVIDIA integration module.

* In-flight work.

* Updating book to comment out planned module.

* Reversing capitalisation change.
  • Loading branch information
grainnejenningsRH authored Aug 8, 2024
1 parent a20fdde commit 481a317
Show file tree
Hide file tree
Showing 10 changed files with 54 additions and 45 deletions.
2 changes: 1 addition & 1 deletion assemblies/managing-cluster-resources.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ include::modules/restoring-the-default-pvc-size-for-your-cluster.adoc[leveloffse

include::modules/overview-of-accelerators.adoc[leveloffset=+1]

include::modules/enabling-gpu-support-in-data-science.adoc[leveloffset=+2]
include::modules/enabling-nvidia-gpus.adoc[leveloffset=+2]

include::modules/enabling-intel-gaudi-ai-accelerators.adoc[leveloffset=+2]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@
[role='_abstract']
When you have enabled the multi-model serving platform, you must configure a model server to deploy models. If you require extra computing power for use with large datasets, you can assign accelerators to your model server.

[NOTE]
====
In {productname-short} {vernum}, {org-name} supports only NVIDIA GPU accelerators for model serving.
====

.Prerequisites
* You have logged in to {productname-long}.
ifndef::upstream[]
Expand All @@ -18,7 +23,7 @@ endif::[]
* You have enabled the multi-model serving platform.
ifndef::upstream[]
* If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See link:{rhoaidocshome}{default-format-url}/serving_models/serving-small-and-medium-sized-models_model-serving#adding-a-custom-model-serving-runtime-for-the-multi-model-serving-platform_model-serving[Adding a custom model-serving runtime].
* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-gpu-support_cluster-mgmt[Enabling GPU support in {productname-short}].
* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-nvidia-gpus_cluster-mgmt[Enabling NVIDIA GPUs].
endif::[]
ifdef::upstream[]
* If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See link:{odhdocshome}/serving-models/#adding-a-custom-model-serving-runtime-for-the-multi-model-serving-platform_model-serving[Adding a custom model-serving runtime].
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ endif::[]

ifndef::upstream[]
* If you want to use graphics processing units (GPUs), you have enabled GPU support in {productname-short}.
See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-gpu-support_cluster-mgmt[Enabling GPU support in {productname-short}].
See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-nvidia-gpus_cluster-mgmt[Enabling NVIDIA GPUs].
+
[NOTE]
====
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ endif::[]

ifndef::upstream[]
* If you want to use graphics processing units (GPUs), you have enabled GPU support in {productname-short}.
See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-gpu-support_cluster-mgmt[Enabling GPU support in {productname-short}].
See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-nvidia-gpus_cluster-mgmt[Enabling NVIDIA GPUs].
+
[NOTE]
====
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,18 @@ endif::[]
* For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.
* To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/built-tip.md#bootstrap-process[Converting Hugging Face Hub models to Caikit format^] in the link:https://github.com/opendatahub-io/caikit-tgis-serving/tree/main[caikit-tgis-serving^] repository.
ifndef::upstream[]
* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-gpu-support_cluster-mgmt[Enabling GPU support in {productname-short}^].
* To use the vLLM runtime, you have enabled GPU support in {productname-short} and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see link:https://docs.openshift.com/container-platform/{ocp-latest-version}/hardware_enablement/psap-node-feature-discovery-operator.html#installing-the-node-feature-discovery-operator_node-feature-discovery-operator[Installing the Node Feature Discovery operator] and link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-gpu-support_cluster-mgmt[Enabling GPU support in {productname-short}^]
* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-nvidia-gpus_cluster-mgmt[Enabling NVIDIA GPUs^].
* To use the vLLM runtime, you have enabled GPU support in {productname-short} and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see link:https://docs.openshift.com/container-platform/{ocp-latest-version}/hardware_enablement/psap-node-feature-discovery-operator.html#installing-the-node-feature-discovery-operator_node-feature-discovery-operator[Installing the Node Feature Discovery operator] and link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-nvidia-gpus_cluster-mgmt[Enabling NVIDIA GPUs^]
endif::[]
ifdef::upstream[]
* To use the vLLM runtime or use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and GPU Operators. For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation.
endif::[]

[NOTE]
====
In {productname-short} {vernum}, {org-name} supports only NVIDIA GPU accelerators for model serving.
====

.Procedure
. In the left menu, click *Data Science Projects*.
+
Expand Down
42 changes: 10 additions & 32 deletions modules/enabling-intel-gaudi-ai-accelerators.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,44 +18,22 @@ endif::[]

.Procedure
. To enable Intel Gaudi AI accelerators in {productname-short}, follow the instructions at link:https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/index.html[HabanaAI Operator for OpenShift].
. From the {productname-short} dashboard, click *Settings* -> *Accelerator profiles*.
+
The *Accelerator profiles* page appears, displaying existing accelerator profiles. To enable or disable an existing accelerator profile, on the row containing the relevant accelerator profile, click the toggle in the *Enable* column.
. Click *Create accelerator profile*.
+
The *Create accelerator profile* dialog opens.
. In the *Name* field, enter a name for the Intel Gaudi AI Accelerator.
. In the *Identifier* field, enter a unique string that identifies the Intel Gaudi AI Accelerator, for example, `habana.ai/gaudi`.
. Optional: In the *Description* field, enter a description for the Intel Gaudi AI Accelerator.
. To enable or disable the accelerator profile for the Intel Gaudi AI Accelerator immediately after creation, click the toggle in the *Enable* column.
. Optional: Add a toleration to schedule pods with matching taints.
.. Click *Add toleration*.
+
The *Add toleration* dialog opens.
.. From the *Operator* list, select one of the following options:
* *Equal* - The *key/value/effect* parameters must match. This is the default.
* *Exists* - The *key/effect* parameters must match. You must leave a blank value parameter, which matches any.
.. From the *Effect* list, select one of the following options:
* *None*
* *NoSchedule* - New pods that do not match the taint are not scheduled onto that node. Existing pods on the node remain.
* *PreferNoSchedule* - New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to. Existing pods on the node remain.
* *NoExecute* - New pods that do not match the taint cannot be scheduled onto that node. Existing pods on the node that do not have a matching toleration are removed.
.. In the *Key* field, enter the toleration key `habana.ai/gaudi`. The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
.. In the *Value* field, enter a toleration value. The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
.. In the *Toleration Seconds* section, select one of the following options to specify how long a pod stays bound to a node that has a node condition.
** *Forever* - Pods stays permanently bound to a node.
** *Custom value* - Enter a value, in seconds, to define how long pods stay bound to a node that has a node condition.
.. Click *Add*.
. Click *Create accelerator profile*.

.Verification
* From the *Administrator* perspective, the following Operators appear on the *Operators* -> *Installed Operators* page.
** HabanaAI
** Node Feature Discovery (NFD)
** Kernel Module Management (KMM)
* The *Accelerator* list displays the Intel Gaudi AI Accelerator on the *Start a notebook server* page. After you select an accelerator, the *Number of accelerators* field appears, which you can use to choose the number of accelerators for your notebook server.
* The accelerator profile appears on the *Accelerator profiles* page
* The accelerator profile appears on the *Instances* tab on the details page for the `AcceleratorProfile` custom resource definition (CRD).

//downstream - all
ifndef::upstream[]
After installing the HabanaAI Operator, create an accelerator profile as described in link:{rhoaidocshome}{default-format-url}/working_with_accelerators/#working-with-accelerator-profiles_accelerators[Working with accelerator profiles].
endif::[]
//upstream only
ifdef::upstream[]
After installing the HabanaAI Operator, create an accelerator profile as described in link:{odhdocshome}/working-with-accelerators/#working-with-accelerator-profiles_accelerators[Working with accelerator profiles].
endif::[]


[role='_additional-resources']
.Additional resources
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,18 @@
//:upstream:
//:self-managed:

[id='enabling-gpu-support_{context}']
= Enabling GPU support in {productname-short}
[id='enabling-nvidia-gpus_{context}']
= Enabling NVIDIA GPUs

[role='_abstract']
Optionally, to ensure that your data scientists can use compute-heavy workloads in their models, you can enable graphics processing units (GPUs) in {productname-short}.
Before you can use NVIDIA GPUs in {productname-short}, you must install the NVIDIA GPU Operator.

//the following note applies to self-managed connected only
ifdef::self-managed[]
ifndef::disconnected[]
[IMPORTANT]
====
If you are using {productname-short} in a disconnected self-managed environment, see link:{rhoaidocshome}{default-format-url}/installing_and_uninstalling_{url-productname-short}_in_a_disconnected_environment/enabling-gpu-support_install[Enabling GPU support in {productname-short}] instead.
If you are using {productname-short} in a disconnected self-managed environment, see link:{rhoaidocshome}{default-format-url}/installing_and_uninstalling_{url-productname-short}_in_a_disconnected_environment/enabling-nvidia-gpus_install[Enabling NVIDIA GPUs] instead.
====
endif::[]
endif::[]
Expand Down
10 changes: 10 additions & 0 deletions modules/nvidia-gpu-integration.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
:_module-type: CONCEPT

[id='nvidia-gpu-integration_{context}']
= NVIDIA GPU integration

[role='_abstract']
//Module to be populated later.



2 changes: 1 addition & 1 deletion modules/starting-a-jupyter-notebook-server.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ ifdef::upstream[]
Using accelerators is only supported with specific notebook images. For GPUs, only the PyTorch, TensorFlow, and CUDA notebook images are supported. For Intel Gaudi AI accelerators, only the HabanaAI notebook image is supported. In addition, you can only specify the number of accelerators required for your notebook server if accelerators are enabled on your cluster.
endif::[]
ifndef::upstream[]
Using accelerators is only supported with specific notebook images. For GPUs, only the PyTorch, TensorFlow, and CUDA notebook images are supported. For Intel Gaudi AI accelerators, only the HabanaAI notebook image is supported. In addition, you can only specify the number of accelerators required for your notebook server if accelerators are enabled on your cluster. To learn how to enable GPU support, see link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-gpu-support_cluster-mgmt[Enabling GPU support in {productname-short}].
Using accelerators is only supported with specific notebook images. For GPUs, only the PyTorch, TensorFlow, and CUDA notebook images are supported. For Intel Gaudi AI accelerators, only the HabanaAI notebook image is supported. In addition, you can only specify the number of accelerators required for your notebook server if accelerators are enabled on your cluster. To learn how to enable accelerator support, see link:{rhoaidocshome}{default-format-url}/working_with_accelerators/overview-of-accelerators_accelerators[Working with accelerators].
endif::[]
--
.. Optional: Select and specify values for any new *Environment variables*.
Expand Down
15 changes: 13 additions & 2 deletions working-with-accelerators.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,21 @@ include::_artifacts/document-attributes-global.adoc[]

Use accelerators, such as NVIDIA GPUs and Intel Gaudi AI accelerators, to optimize the performance of your end-to-end data science workflows.

//Overview of accelerators
include::modules/overview-of-accelerators.adoc[leveloffset=+1]

//Specific partner content
//NVIDIA GPUs
//include::modules/nvidia-gpu-integration.adoc[leveloffset=+1]

include::modules/enabling-nvidia-gpus.adoc[leveloffset=+1]

//Intel Gaudi AI accelerators
include::modules/intel-gaudi-ai-accelerator-integration.adoc[leveloffset=+1]

include::modules/enabling-intel-gaudi-ai-accelerators.adoc[leveloffset=+2]

//Using accelerator profiles
include::modules/working-with-accelerator-profiles.adoc[leveloffset=+1]

include::modules/creating-an-accelerator-profile.adoc[leveloffset=+2]
Expand All @@ -34,7 +47,5 @@ include::modules/configuring-a-recommended-accelerator-for-notebook-images.adoc[

include::modules/configuring-a-recommended-accelerator-for-serving-runtimes.adoc[leveloffset=+2]

include::modules/intel-gaudi-ai-accelerator-integration.adoc[leveloffset=+1]

include::modules/enabling-intel-gaudi-ai-accelerators.adoc[leveloffset=+2]

0 comments on commit 481a317

Please sign in to comment.