Fast GPU Provisioning Technology

Introduction

Fast GPU Provisioning technology enables GPU provisioning less than 1 second with no reboots using pre-built driver containers. The feature eliminates any dependency on machine configuration which triggers reboot, an expensive operation. Instead, the required operations are performed at runtime. This leads to a simplified and accelerated deployment process.

Details

In order to achieve this in the 1.2.1 release, 2 KMM features were leveraged. The first is setting the firmware search path on the fly which is required to load the out of tree firmware binaries on RHCOS. KMM writes the alternative firmware search path to sysfs right before loading the out-of-tree drivers. Second, KMM removes the in-tree intel_vsec driver on the fly which is required prior to loading the out-of-tree equivalent. Previously, both of these operations required a machine configuration which triggered unnecessary reboots.

Impact

Reboot is a costly operation and adds several minutes to the provisioning process. In many cases, especially in production after Day 2, reboot is not an option. Going from minutes to seconds, this feature enables faster GPU provisioning without any reboots by performing configuration changes at runtime, a welcome change. It is important to note that this feature is especially useful for SNO cluster setups to avoid SNO downtime.

Intel Technology Enabling for OpenShift Architecture and Working Scope

The Intel Technology Enabling for OpenShift project provides Intel Data Center hardware feature-provisioning technologies with the Red Hat OpenShift Container Platform (RHOCP). The technology to deploy and manage Intel Enterprise AI End-to-End (E2E) solutions and the related reference workloads for these features are also included in the project.