- Intel GPU driver (GPU Driver Installation Guides). XPU-SMI is included in the GPU driver repository. If you want to use XPU Manager, please uninstall xpu-smi and install XPU Manager.
- Intel(R) Graphics Compute Runtime for oneAPI Level Zero (intel-level-zero-gpu and level-zero in package repositories)
- Intel(R) Graphics System Controller Firmware Update Library (intel-gsc in package repositories)
- Intel(R) Media Driver (intel-media-va-driver-non-free or intel-media in package repositories)
- Intel(R) Media SDK Utilities (libmfx-tools or intel-mediasdk-utils in package repositories)
- Intel(R) oneVPL GPU Runtime (libmfxgen1 in package repositories)
- Intel(R) Metrics Library for MDAPI (intel-metrics-library or libigdml1 in package repositories)
- Intel(R) Metrics Discovery Application Programming Interface (intel-metrics-discovery or libmd1 in package repositories)
intel-metrics-library (libigdml1) and intel-metrics-discovery (libmd1) are optional. You may use the parameter like "--force-all" to ignore them when installing Intel(R) XPU Manager.
After adding the repository and installing the required kernel/run-time packages on GPU Driver Installation Guides, you may download the latest installer package from the Github release page and run apt command below to install XPU Manager and the required dependencies.
sudo apt install ./xpumanager.xxxxxxxx.xxxxxx.xxxxxxxx.deb
sudo dpkg -r xpumanager
After importing the repository and required kernel/run-time packages on GPU Driver Installation Guides, you may download the latest installer package from the Github release page and run dnf command below to install XPU Manager and the required dependencies.
sudo dnf install xpumanager.xxxxxxxx.xxxxxx.xxxxxxxx.rpm
After importing the repository and required kernel/run-time packages on GPU Driver Installation Guides, you may download the latest installer package from the Github release page and run zypper command below to install XPU Manager and the required dependencies.
sudo zypper install xpumanager.xxxxxxxx.xxxxxx.xxxxxxxx.rpm
By default, XPU Manager is installed the folder, /usr/bin, /usr/lib and /usr/lib64. The command line tool is /usr/bin/xpumcli. Please refer to "CLI_user_guide.md" for how to use the command line tool.
rpm -i --prefix=/usr/local xpumanager.xxxxxxxx.xxxxxx.xxxxxxxx.rpm
You need set the environmental variable LD_LIBRARY_PATH if you change the installation folder.
sudo rpm -e xpumanager
By default, XPU Manager has provided as many GPU metrics as possible without changing the system settings. You may follow the steps below to collect more metrics or disable some metrics.
- edit file "/lib/systemd/system/xpum.service" or "/etc/systemd/system/xpum.service" in some system.
add "-m metric-indexes" to ExecStart.
Use "/usr/bin/xpumd -h" to get detailed info.
Sample: ExecStart=/usr/bin/xpumd -p /var/xpum_daemon.pid -d /usr/lib/xpum/dump -m 0,4-38 - Run command "sudo systemctl daemon-reload"
- Run command "sudo systemctl restart xpum"
Metric types:
- GPU Utilization (%), GPU active time of the elapsed time, per tile
- GPU EU Array Active (%), the normalized sum of all cycles on all EUs that were spent actively executing instructions, per tile (Disabled by default)
- GPU EU Array Stall (%), the normalized sum of all cycles on all EUs during which the EUs were stalled. Per tile. At least one thread is loaded, but the EU is stalled, per tile. (Disabled by default)
- GPU EU Array Idle (%), the normalized sum of all cycles on all cores when no threads were scheduled on a core. per tile. (Disabled by default)
- GPU Power (W), per tile
- GPU Energy Consumed (J), per tile
- GPU Frequency (MHz), per tile
- GPU Core Temperature (Celsius Degree), per tile
- GPU Memory Used (MiB)
- GPU Memory Utilization (%), per tile
- GPU Memory Bandwidth Utilization (%), per tile
- GPU Memory Read (kB), per tile
- GPU Memory Write (kB), per tile
- GPU Memory Read Throughput(kB/s), per tile
- GPU Memory Write Throughput(kB/s), per tile
- GPU Compute Engine Group Utilization (%), per tile
- GPU Media Engine Group Utilization (%), per tile
- GPU Copy Engine Group Utilization (%), per tile
- GPU Render Engine Group Utilization (%), per tile
- GPU 3D Engine Group Utilization (%), per tile
- Reset Counter, per GPU
- Programming Errors, per tile
- Driver Errors, per tile
- Cache Errors Correctable, per tile
- Cache Errors Uncorrectable, per tile
- Display Errors Correctable, per tile (Not supported so far)
- Display Errors Uncorrectable, per tile (Not supported so far)
- Memory Errors Correctable, per tile
- Memory Errors Uncorrectable, per tile
- GPU Requested Frequency, per tile
- GPU Memory Temperature, per tile
- GPU Frequency Throttle Ratio, per tile (Not supported so far)
- GPU PCIe Read Throughput (kB/s), per GPU (Disabled by default)
- GPU PCIe Write Throughput (kB/s), per GPU (Disabled by default)
- GPU PCIe Read (bytes), per GPU (Disabled by default)
- GPU PCIe Write (bytes), per GPU (Disabled by default)
- GPU Engine Utilization, per GPU engine
- Fabric Throughput (kB/s), per tile
- Throttle reason, per tile
- GPU PCIe Read/Write Throughput: if these metrics are enabled, XPU Manager automatically loads MSR module by command 'modprobe msr', but XPU Manager will not automatically unload the MSR module. If you want to unload it, please run the command 'modprobe -r msr'.
- We have tried our best to reduce the CPU usage of XPU Manager daemon. If you have many GPUs (10+) and still have concern with the CPU usage of XPU Manager daemon, you may disable some the RAS related metrics below to reduce the CPU usage further.
-
- Reset Counter, per GPU
-
- Programming Errors, per tile
-
- Driver Errors, per tile
-
- Cache Errors Correctable, per tile
-
- Cache Errors Uncorrectable, per tile
-
- Display Errors Correctable, per tile (Not supported so far)
-
- Display Errors Uncorrectable, per tile (Not supported so far)
-
- Memory Errors Correctable, per tile
-
- Memory Errors Uncorrectable, per tile
-
XPU Manager provides the GPU memory ECC on/off feature based on IGSC. GPU memory ECC on/off starts to work since IGSC 0.8.4. If you want to use this feature, please make sure that you install IGSC 0.8.4 or newer version.
CentOS 7 still has the old version of libcurl. If you need update the AMC firmware through Redfish host interface, please follow the steps below to build and install libcurl.
yum update -y
yum install wget gcc openssl-devel make -y
wget https://curl.se/download/curl-7.56.1.tar.gz
tar xzf curl-7.56.1.tar.gz
cd curl-7.56.1
./configure --with-openssl --prefix=/usr
make
sudo make install
curl --version
The Windows GPU driver version should be 31.0.101.3902 or newer.