diff --git a/assemblies/managing-cluster-resources.adoc b/assemblies/managing-cluster-resources.adoc index 5f724396..4db43bb6 100644 --- a/assemblies/managing-cluster-resources.adoc +++ b/assemblies/managing-cluster-resources.adoc @@ -18,7 +18,7 @@ include::modules/restoring-the-default-pvc-size-for-your-cluster.adoc[leveloffse include::modules/overview-of-accelerators.adoc[leveloffset=+1] -include::modules/enabling-gpu-support-in-data-science.adoc[leveloffset=+2] +include::modules/enabling-nvidia-gpus.adoc[leveloffset=+2] include::modules/enabling-intel-gaudi-ai-accelerators.adoc[leveloffset=+2] diff --git a/modules/adding-a-model-server-for-the-multi-model-serving-platform.adoc b/modules/adding-a-model-server-for-the-multi-model-serving-platform.adoc index e2a5b48d..2b261770 100644 --- a/modules/adding-a-model-server-for-the-multi-model-serving-platform.adoc +++ b/modules/adding-a-model-server-for-the-multi-model-serving-platform.adoc @@ -6,6 +6,11 @@ [role='_abstract'] When you have enabled the multi-model serving platform, you must configure a model server to deploy models. If you require extra computing power for use with large datasets, you can assign accelerators to your model server. +[NOTE] +==== +In {productname-short} {vernum}, {org-name} supports only NVIDIA GPU accelerators for model serving. +==== + .Prerequisites * You have logged in to {productname-long}. ifndef::upstream[] @@ -18,7 +23,7 @@ endif::[] * You have enabled the multi-model serving platform. ifndef::upstream[] * If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See link:{rhoaidocshome}{default-format-url}/serving_models/serving-small-and-medium-sized-models_model-serving#adding-a-custom-model-serving-runtime-for-the-multi-model-serving-platform_model-serving[Adding a custom model-serving runtime]. -* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-gpu-support_cluster-mgmt[Enabling GPU support in {productname-short}]. +* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-nvidia-gpus_cluster-mgmt[Enabling NVIDIA GPUs]. endif::[] ifdef::upstream[] * If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See link:{odhdocshome}/serving-models/#adding-a-custom-model-serving-runtime-for-the-multi-model-serving-platform_model-serving[Adding a custom model-serving runtime]. diff --git a/modules/configuring-quota-management-for-distributed-workloads.adoc b/modules/configuring-quota-management-for-distributed-workloads.adoc index 15694731..e40db01e 100644 --- a/modules/configuring-quota-management-for-distributed-workloads.adoc +++ b/modules/configuring-quota-management-for-distributed-workloads.adoc @@ -57,7 +57,7 @@ endif::[] ifndef::upstream[] * If you want to use graphics processing units (GPUs), you have enabled GPU support in {productname-short}. -See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-gpu-support_cluster-mgmt[Enabling GPU support in {productname-short}]. +See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-nvidia-gpus_cluster-mgmt[Enabling NVIDIA GPUs]. + [NOTE] ==== diff --git a/modules/configuring-the-distributed-workloads-components.adoc b/modules/configuring-the-distributed-workloads-components.adoc index 85ed8bc0..84882ed5 100644 --- a/modules/configuring-the-distributed-workloads-components.adoc +++ b/modules/configuring-the-distributed-workloads-components.adoc @@ -47,7 +47,7 @@ endif::[] ifndef::upstream[] * If you want to use graphics processing units (GPUs), you have enabled GPU support in {productname-short}. -See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-gpu-support_cluster-mgmt[Enabling GPU support in {productname-short}]. +See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-nvidia-gpus_cluster-mgmt[Enabling NVIDIA GPUs]. + [NOTE] ==== diff --git a/modules/deploying-models-on-the-single-model-serving-platform.adoc b/modules/deploying-models-on-the-single-model-serving-platform.adoc index fde5e0c8..06345246 100644 --- a/modules/deploying-models-on-the-single-model-serving-platform.adoc +++ b/modules/deploying-models-on-the-single-model-serving-platform.adoc @@ -29,13 +29,18 @@ endif::[] * For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket. * To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/built-tip.md#bootstrap-process[Converting Hugging Face Hub models to Caikit format^] in the link:https://github.com/opendatahub-io/caikit-tgis-serving/tree/main[caikit-tgis-serving^] repository. ifndef::upstream[] -* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-gpu-support_cluster-mgmt[Enabling GPU support in {productname-short}^]. -* To use the vLLM runtime, you have enabled GPU support in {productname-short} and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see link:https://docs.openshift.com/container-platform/{ocp-latest-version}/hardware_enablement/psap-node-feature-discovery-operator.html#installing-the-node-feature-discovery-operator_node-feature-discovery-operator[Installing the Node Feature Discovery operator] and link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-gpu-support_cluster-mgmt[Enabling GPU support in {productname-short}^] +* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-nvidia-gpus_cluster-mgmt[Enabling NVIDIA GPUs^]. +* To use the vLLM runtime, you have enabled GPU support in {productname-short} and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see link:https://docs.openshift.com/container-platform/{ocp-latest-version}/hardware_enablement/psap-node-feature-discovery-operator.html#installing-the-node-feature-discovery-operator_node-feature-discovery-operator[Installing the Node Feature Discovery operator] and link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-nvidia-gpus_cluster-mgmt[Enabling NVIDIA GPUs^] endif::[] ifdef::upstream[] * To use the vLLM runtime or use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and GPU Operators. For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation. endif::[] +[NOTE] +==== +In {productname-short} {vernum}, {org-name} supports only NVIDIA GPU accelerators for model serving. +==== + .Procedure . In the left menu, click *Data Science Projects*. + diff --git a/modules/enabling-intel-gaudi-ai-accelerators.adoc b/modules/enabling-intel-gaudi-ai-accelerators.adoc index 7ac447cc..88aa85b7 100644 --- a/modules/enabling-intel-gaudi-ai-accelerators.adoc +++ b/modules/enabling-intel-gaudi-ai-accelerators.adoc @@ -18,44 +18,22 @@ endif::[] .Procedure . To enable Intel Gaudi AI accelerators in {productname-short}, follow the instructions at link:https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/index.html[HabanaAI Operator for OpenShift]. -. From the {productname-short} dashboard, click *Settings* -> *Accelerator profiles*. -+ -The *Accelerator profiles* page appears, displaying existing accelerator profiles. To enable or disable an existing accelerator profile, on the row containing the relevant accelerator profile, click the toggle in the *Enable* column. -. Click *Create accelerator profile*. -+ -The *Create accelerator profile* dialog opens. -. In the *Name* field, enter a name for the Intel Gaudi AI Accelerator. -. In the *Identifier* field, enter a unique string that identifies the Intel Gaudi AI Accelerator, for example, `habana.ai/gaudi`. -. Optional: In the *Description* field, enter a description for the Intel Gaudi AI Accelerator. -. To enable or disable the accelerator profile for the Intel Gaudi AI Accelerator immediately after creation, click the toggle in the *Enable* column. -. Optional: Add a toleration to schedule pods with matching taints. -.. Click *Add toleration*. -+ -The *Add toleration* dialog opens. -.. From the *Operator* list, select one of the following options: -* *Equal* - The *key/value/effect* parameters must match. This is the default. -* *Exists* - The *key/effect* parameters must match. You must leave a blank value parameter, which matches any. -.. From the *Effect* list, select one of the following options: -* *None* -* *NoSchedule* - New pods that do not match the taint are not scheduled onto that node. Existing pods on the node remain. -* *PreferNoSchedule* - New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to. Existing pods on the node remain. -* *NoExecute* - New pods that do not match the taint cannot be scheduled onto that node. Existing pods on the node that do not have a matching toleration are removed. -.. In the *Key* field, enter the toleration key `habana.ai/gaudi`. The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores. -.. In the *Value* field, enter a toleration value. The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores. -.. In the *Toleration Seconds* section, select one of the following options to specify how long a pod stays bound to a node that has a node condition. -** *Forever* - Pods stays permanently bound to a node. -** *Custom value* - Enter a value, in seconds, to define how long pods stay bound to a node that has a node condition. -.. Click *Add*. -. Click *Create accelerator profile*. .Verification * From the *Administrator* perspective, the following Operators appear on the *Operators* -> *Installed Operators* page. ** HabanaAI ** Node Feature Discovery (NFD) ** Kernel Module Management (KMM) -* The *Accelerator* list displays the Intel Gaudi AI Accelerator on the *Start a notebook server* page. After you select an accelerator, the *Number of accelerators* field appears, which you can use to choose the number of accelerators for your notebook server. -* The accelerator profile appears on the *Accelerator profiles* page -* The accelerator profile appears on the *Instances* tab on the details page for the `AcceleratorProfile` custom resource definition (CRD). + +//downstream - all +ifndef::upstream[] +After installing the HabanaAI Operator, create an accelerator profile as described in link:{rhoaidocshome}{default-format-url}/working_with_accelerators/#working-with-accelerator-profiles_accelerators[Working with accelerator profiles]. +endif::[] +//upstream only +ifdef::upstream[] +After installing the HabanaAI Operator, create an accelerator profile as described in link:{odhdocshome}/working-with-accelerators/#working-with-accelerator-profiles_accelerators[Working with accelerator profiles]. +endif::[] + [role='_additional-resources'] .Additional resources diff --git a/modules/enabling-gpu-support-in-data-science.adoc b/modules/enabling-nvidia-gpus.adoc similarity index 92% rename from modules/enabling-gpu-support-in-data-science.adoc rename to modules/enabling-nvidia-gpus.adoc index 1433e689..349f89a1 100644 --- a/modules/enabling-gpu-support-in-data-science.adoc +++ b/modules/enabling-nvidia-gpus.adoc @@ -3,18 +3,18 @@ //:upstream: //:self-managed: -[id='enabling-gpu-support_{context}'] -= Enabling GPU support in {productname-short} +[id='enabling-nvidia-gpus_{context}'] += Enabling NVIDIA GPUs [role='_abstract'] -Optionally, to ensure that your data scientists can use compute-heavy workloads in their models, you can enable graphics processing units (GPUs) in {productname-short}. +Before you can use NVIDIA GPUs in {productname-short}, you must install the NVIDIA GPU Operator. //the following note applies to self-managed connected only ifdef::self-managed[] ifndef::disconnected[] [IMPORTANT] ==== -If you are using {productname-short} in a disconnected self-managed environment, see link:{rhoaidocshome}{default-format-url}/installing_and_uninstalling_{url-productname-short}_in_a_disconnected_environment/enabling-gpu-support_install[Enabling GPU support in {productname-short}] instead. +If you are using {productname-short} in a disconnected self-managed environment, see link:{rhoaidocshome}{default-format-url}/installing_and_uninstalling_{url-productname-short}_in_a_disconnected_environment/enabling-nvidia-gpus_install[Enabling NVIDIA GPUs] instead. ==== endif::[] endif::[] diff --git a/modules/nvidia-gpu-integration.adoc b/modules/nvidia-gpu-integration.adoc new file mode 100644 index 00000000..ab361cea --- /dev/null +++ b/modules/nvidia-gpu-integration.adoc @@ -0,0 +1,10 @@ +:_module-type: CONCEPT + +[id='nvidia-gpu-integration_{context}'] += NVIDIA GPU integration + +[role='_abstract'] +//Module to be populated later. + + + diff --git a/modules/starting-a-jupyter-notebook-server.adoc b/modules/starting-a-jupyter-notebook-server.adoc index 793559f0..1b5f234a 100644 --- a/modules/starting-a-jupyter-notebook-server.adoc +++ b/modules/starting-a-jupyter-notebook-server.adoc @@ -54,7 +54,7 @@ ifdef::upstream[] Using accelerators is only supported with specific notebook images. For GPUs, only the PyTorch, TensorFlow, and CUDA notebook images are supported. For Intel Gaudi AI accelerators, only the HabanaAI notebook image is supported. In addition, you can only specify the number of accelerators required for your notebook server if accelerators are enabled on your cluster. endif::[] ifndef::upstream[] -Using accelerators is only supported with specific notebook images. For GPUs, only the PyTorch, TensorFlow, and CUDA notebook images are supported. For Intel Gaudi AI accelerators, only the HabanaAI notebook image is supported. In addition, you can only specify the number of accelerators required for your notebook server if accelerators are enabled on your cluster. To learn how to enable GPU support, see link:{rhoaidocshome}{default-format-url}/managing_resources/managing-cluster-resources_cluster-mgmt#enabling-gpu-support_cluster-mgmt[Enabling GPU support in {productname-short}]. +Using accelerators is only supported with specific notebook images. For GPUs, only the PyTorch, TensorFlow, and CUDA notebook images are supported. For Intel Gaudi AI accelerators, only the HabanaAI notebook image is supported. In addition, you can only specify the number of accelerators required for your notebook server if accelerators are enabled on your cluster. To learn how to enable accelerator support, see link:{rhoaidocshome}{default-format-url}/working_with_accelerators/overview-of-accelerators_accelerators[Working with accelerators]. endif::[] -- .. Optional: Select and specify values for any new *Environment variables*. diff --git a/working-with-accelerators.adoc b/working-with-accelerators.adoc index b6909ce1..703a016d 100644 --- a/working-with-accelerators.adoc +++ b/working-with-accelerators.adoc @@ -18,8 +18,21 @@ include::_artifacts/document-attributes-global.adoc[] Use accelerators, such as NVIDIA GPUs and Intel Gaudi AI accelerators, to optimize the performance of your end-to-end data science workflows. +//Overview of accelerators include::modules/overview-of-accelerators.adoc[leveloffset=+1] +//Specific partner content +//NVIDIA GPUs +//include::modules/nvidia-gpu-integration.adoc[leveloffset=+1] + +include::modules/enabling-nvidia-gpus.adoc[leveloffset=+1] + +//Intel Gaudi AI accelerators +include::modules/intel-gaudi-ai-accelerator-integration.adoc[leveloffset=+1] + +include::modules/enabling-intel-gaudi-ai-accelerators.adoc[leveloffset=+2] + +//Using accelerator profiles include::modules/working-with-accelerator-profiles.adoc[leveloffset=+1] include::modules/creating-an-accelerator-profile.adoc[leveloffset=+2] @@ -34,7 +47,5 @@ include::modules/configuring-a-recommended-accelerator-for-notebook-images.adoc[ include::modules/configuring-a-recommended-accelerator-for-serving-runtimes.adoc[leveloffset=+2] -include::modules/intel-gaudi-ai-accelerator-integration.adoc[leveloffset=+1] -include::modules/enabling-intel-gaudi-ai-accelerators.adoc[leveloffset=+2]