Skip to content

Fast GPU Provisioning Technology

MartinXu edited this page Mar 21, 2024 · 1 revision

Introduction

Fast GPU Provisioning technology enables GPU provisioning less than 1 second with no reboots using pre-built driver containers. The feature eliminates any dependency on machine configuration which triggers reboot, an expensive operation. Instead, the required operations are performed at runtime. This leads to a simplified and accelerated deployment process.

Details

In order to achieve this in the 1.2.1 release, 2 KMM features were leveraged. The first is setting the firmware search path on the fly which is required to load the out of tree firmware binaries on RHCOS. KMM writes the alternative firmware search path to sysfs right before loading the out-of-tree drivers. Second, KMM removes the in-tree intel_vsec driver on the fly which is required prior to loading the out-of-tree equivalent. Previously, both of these operations required a machine configuration which triggered unnecessary reboots.

Impact

Reboot is a costly operation and adds several minutes to the provisioning process. In many cases, especially in production after Day 2, reboot is not an option. Going from minutes to seconds, this feature enables faster GPU provisioning without any reboots by performing configuration changes at runtime, a welcome change. It is important to note that this feature is especially useful for SNO cluster setups to avoid SNO downtime.