Skip to content

Commit

Permalink
publish from a016ccad (gopub)
Browse files Browse the repository at this point in the history
  • Loading branch information
sysxpum committed Dec 25, 2023
1 parent e97df1b commit dcc3345
Show file tree
Hide file tree
Showing 46 changed files with 846 additions and 552 deletions.
15 changes: 8 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Intel(R) XPU Manager is a free and open-source tool for monitoring and managing

It is designed to simplify administration, maximize reliability and uptime, and improve utilization.

XPU Manager can be used standalone through its command line interface (CLI) to manage GPUs locally, or through its RESTful APIs to manage GPUs remotely. Intel(R) XPU System Management Interface (XPU-SMI) is the daemon-less version of XPU Manager and it only provides the local interface. XPU-SMI feature scope is the subset of XPU Manager. Their features are listed in the table below. Please note that XPU-SMI has been included in the GPU driver repository. If you want to use XPU Manager, please uninstall XPU-SMI and install XPU Manager.
XPU Manager can be used standalone through its command line interface (CLI) to manage GPUs locally, or through its RESTful APIs to manage GPUs remotely. Intel(R) XPU System Management Interface (XPU-SMI) is the daemon-less version of XPU Manager and it only provides the local interface. XPU-SMI feature scope is the subset of XPU Manager. Their features are listed in the table below. Please note that XPU-SMI and XPU Manager can't be installed or executed on the same system due to some resource conflict. XPU-SMI has been included in the GPU driver repository. If you want to use XPU Manager, please uninstall XPU-SMI and install XPU Manager.

amcmcli is a portable CLI tool to manage GPU AMC firmware on Linux OS. It is independent of GPU driver.

Expand Down Expand Up @@ -74,8 +74,8 @@ Update firmware successfully.
```


## Feature set of XPU Manager, XPU-SMI and Windows CLI tool
| | XPU Manager | XPU-SMI | Windows CLI tool | amcmcli |
## Feature set of XPU Manager, XPU-SMI and XPU-SMI Windows CLI tool
| | XPU Manager | XPU-SMI | XPU-SMI Windows CLI | amcmcli |
| :------------------------ | :--------------------: | :------------------: | :--------------------------: | :-------------: |
| Device Info and Topology | Yes | Yes | Yes | No |
| GPU Telemetries | Yes (aggregated data) | Yes (real-time data) | Yes (real-time data) | No |
Expand All @@ -98,16 +98,16 @@ You may get the latest installers or binaries in [Releases](https://github.com/i
## Supported OSes
* XPU Manager
* Ubuntu 20.04.3/22.04
* RHEL 8.5/8.6/8.8/9.2
* RHEL 8.8/9.2
* CentOS 8/9 Stream
* CentOS 7.4/7.9
* SLES 15 SP3/SP4
* SLES 15 SP4/SP5
* XPU-SMI
* Ubuntu 20.04.3/22.04
* RHEL 8.5/8.6/8.8/9.2
* RHEL 8.8/9.2
* CentOS 8/9 Stream
* CentOS 7.4/7.9
* SLES 15 SP3/SP4
* SLES 15 SP4/SP5
* Debian 10.13
* Windows Server 2019/2022 (limited features including: GPU device info, GPU telemetry, GPU firmware update and GPU configuration)

Expand All @@ -121,6 +121,7 @@ You may get the latest installers or binaries in [Releases](https://github.com/i
* Refer to [DockerHub](https://hub.docker.com/r/intel/xpumanager) for a Docker container image that can be used as a Prometheus exporter in a Kubernetes environment.
* Refer to [Building XPU Manager Installer](BUILDING.md) to build XPU Manager installer packages.
* Refer to [XPU Manager/XPU-SMI API documents](https://intel.github.io/xpumanager/smi_index.html) to integrate the library or RESTFul interface.
* A simple introduction video on [Youtube](https://www.youtube.com/watch?v=1bKeqlriDX0).

## Architecture
![XPU Manager Architecture](doc/img/architecture.PNG)
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.2.25
1.2.26
11 changes: 8 additions & 3 deletions cli/src/comlet_diagnostic.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -187,8 +187,7 @@ Alternatively the strings \"yesterday\", \"today\" are also understood.\n\
Relative times also may be specified, prefixed with \"-\" referring to times before the current time.\n\
Scanning would start from the latest boot if it is not specified.");

auto singleTestIdList = addOption("--singletest", this->opts->singleTestIdList,
"Selectively run some particular tests. Separated by the comma.\n\
std::string singleTestIdListDesc = "Selectively run some particular tests. Separated by the comma.\n\
1. Computation\n\
2. Memory Error\n\
3. Memory Bandwidth\n\
Expand All @@ -197,7 +196,13 @@ Scanning would start from the latest boot if it is not specified.");
6. Power\n\
7. Computation functional test\n\
8. Media Codec functional test\n\
9. Xe Link Throughput");
9. Xe Link Throughput";
#ifdef DAEMONLESS
singleTestIdListDesc += "\nNote that in a multi NUMA node server, it may need to use numactl to specify which node the PCIe bandwidth test runs on.\n\
Usage: numactl [ --membind nodes ] [ --cpunodebind nodes ] xpu-smi diag -d [deviceId] --singletest 5\n\
It also applies to diag level tests.";
#endif
auto singleTestIdList = addOption("--singletest", this->opts->singleTestIdList, singleTestIdListDesc);
singleTestIdList->delimiter(',');
singleTestIdList->check(CLI::Range(1, (int)testIdToType.size()));

Expand Down
4 changes: 2 additions & 2 deletions cli/src/comlet_dump.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -173,10 +173,10 @@ bool ComletDump::dumpIdlePowerOnly() {
auto bdf_pos = uevent.find(bdf_key);
if (bdf_pos != std::string::npos) {
auto device_id = uevent.substr(pos + key.length(), 4);
if (device_id.compare(0, 3, "0BD") == 0 || device_id.compare(0, 3, "0BE") == 0) {
if (device_id.compare(0, 3, "0BD") == 0 || device_id.compare(0, 3, "0BE") == 0 || device_id.compare(0, 3, "0B6") == 0) {
auto bdf = uevent.substr(bdf_pos + bdf_key.length(), 12);
gpu_bdfs.insert(bdf);
if (device_id.compare("0BD9") == 0 || device_id.compare("0BDA") == 0 || device_id.compare("0BDB") == 0) {
if (device_id.compare("0BD9") == 0 || device_id.compare("0BDA") == 0 || device_id.compare("0BDB") == 0 || device_id.compare("0B6E") == 0) {
gpu_bdf_to_tile_num[bdf] = 1;
} else {
gpu_bdf_to_tile_num[bdf] = 2;
Expand Down
5 changes: 4 additions & 1 deletion cli/src/grpc_stub/grpc_core_stub.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
#include "core_stub.h"

#include <grpc++/grpc++.h>
#include <grpc/impl/codegen/grpc_types.h>

#include <cassert>
#include <chrono>
Expand Down Expand Up @@ -35,7 +36,9 @@ GrpcCoreStub::GrpcCoreStub(bool priv) {
}
std::string unixSockName{unixSockDir + (priv ? "xpum_p.sock" : "xpum_up.sock")};
std::string serverAddr{"unix://" + unixSockName};
this->channel = grpc::CreateChannel(serverAddr, grpc::InsecureChannelCredentials());
grpc::ChannelArguments args;
args.SetInt(GRPC_ARG_ENABLE_HTTP_PROXY, 0);
this->channel = grpc::CreateCustomChannel(serverAddr, grpc::InsecureChannelCredentials(), args);
this->stub = XpumCoreService::NewStub(this->channel);
}

Expand Down
4 changes: 3 additions & 1 deletion cli/src/local_functions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -523,7 +523,9 @@ std::unique_ptr<nlohmann::json> addKernelParam() {
* Refer: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/ch-working_with_the_grub_2_boot_loader
*/
cmdStr = "grub2-mkconfig -o /boot/efi/EFI/rhel/grub.cfg";
}
} else if (osRelease == LINUX_OS_RELEASE_OPEN_EULER && isFileExists("/boot/efi/EFI/openEuler/grub.cfg")) {
cmdStr = "grub2-mkconfig -o /boot/efi/EFI/openEuler/grub.cfg";
}
if (execCommand(cmdStr, cmdRes) != 0) {
(*json)["error"] = "Fail to update grub.";
(*json)["errno"] = XPUM_CLI_ERROR_VGPU_ADD_KERNEL_PARAM_FAILED;
Expand Down
2 changes: 2 additions & 0 deletions cli/src/utility.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,8 @@ linux_os_release_t getOsRelease() {
return LINUX_OS_RELEASE_RHEL;
} else if (value.find("debian") != std::string::npos) {
return LINUX_OS_RELEASE_DEBIAN;
} else if (value.find("openEuler") != std::string::npos) {
return LINUX_OS_RELEASE_OPEN_EULER;
} else {
return LINUX_OS_RELEASE_UNKNOWN;
}
Expand Down
1 change: 1 addition & 0 deletions cli/src/utility.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ typedef enum linux_os_release_t {
LINUX_OS_RELEASE_SLES,
LINUX_OS_RELEASE_RHEL,
LINUX_OS_RELEASE_DEBIAN,
LINUX_OS_RELEASE_OPEN_EULER,
LINUX_OS_RELEASE_UNKNOWN,
} linux_os_release_t;

Expand Down
34 changes: 31 additions & 3 deletions core/resources/config/diagnostics.conf
Original file line number Diff line number Diff line change
Expand Up @@ -57,20 +57,41 @@ SINGLE_PRECISION_MIN_GFLOPS = 8000 # GFLOPS
POWER_MIN_STRESS_WATT = 80 # W
MEMORY_BANDWIDTH_MIN_GBPS = 320 # GBPS

# PVC 2T 2048EU 128GB
# PVC 2T 1024EU 128GB
NAME = Intel(R) Graphics [0x0bd4]
PCIE_BANDWIDTH_MIN_GBPS = 22 # GBPS
SINGLE_PRECISION_MIN_GFLOPS = 27000 # GFLOPS
POWER_MIN_STRESS_WATT = 240 # W
MEMORY_BANDWIDTH_MIN_GBPS = 1000 # GBPS

# PVC 2T 1024EU 128GB
NAME = Intel(R) Graphics [0x0bd5]
PCIE_BANDWIDTH_MIN_GBPS = 22 # GBPS
SINGLE_PRECISION_MIN_GFLOPS = 27000 # GFLOPS
POWER_MIN_STRESS_WATT = 240 # W
MEMORY_BANDWIDTH_MIN_GBPS = 1000 # GBPS

# PVC 2T 2048EU 128GB
# PVC 2T 1024EU 128GB
NAME = Intel(R) Graphics [0x0bd6]
PCIE_BANDWIDTH_MIN_GBPS = 32 # GBPS
SINGLE_PRECISION_MIN_GFLOPS = 35000 # GFLOPS
POWER_MIN_STRESS_WATT = 320 # W
MEMORY_BANDWIDTH_MIN_GBPS = 1600 # GBPS

# PVC 2T 896EU 96GB
NAME = Intel(R) Graphics [0x0bd7]
PCIE_BANDWIDTH_MIN_GBPS = 30 # GBPS
SINGLE_PRECISION_MIN_GFLOPS = 35000 # GFLOPS
POWER_MIN_STRESS_WATT = 360 # W
MEMORY_BANDWIDTH_MIN_GBPS = 1100 # GBPS

# PVC 2T 768EU 96GB
NAME = Intel(R) Graphics [0x0bd8]
PCIE_BANDWIDTH_MIN_GBPS = 30 # GBPS
SINGLE_PRECISION_MIN_GFLOPS = 30000 # GFLOPS
POWER_MIN_STRESS_WATT = 360 # W
MEMORY_BANDWIDTH_MIN_GBPS = 1100 # GBPS

# PVC 1T 448EU 48GB
NAME = Intel(R) Graphics [0x0bdb]
PCIE_BANDWIDTH_MIN_GBPS = 32 # GBPS
Expand All @@ -85,7 +106,14 @@ SINGLE_PRECISION_MIN_GFLOPS = 16000 # GFLOPS
POWER_MIN_STRESS_WATT = 240 # W
MEMORY_BANDWIDTH_MIN_GBPS = 560 # GBPS

# PVC 2T 2048EU 128GB
# PVC 1T 448EU 48GB
NAME = Intel(R) Graphics [0x0b6e]
PCIE_BANDWIDTH_MIN_GBPS = 28 # GBPS
SINGLE_PRECISION_MIN_GFLOPS = 15000 # GFLOPS
POWER_MIN_STRESS_WATT = 220 # W
MEMORY_BANDWIDTH_MIN_GBPS = 520 # GBPS

# PVC 2T 1024EU 128GB
NAME = Intel(R) Graphics [0x0b69]
PCIE_BANDWIDTH_MIN_GBPS = 36 # GBPS
SINGLE_PRECISION_MIN_GFLOPS = 42000 # GFLOPS
Expand Down
36 changes: 18 additions & 18 deletions core/resources/config/vgpu.conf
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ SCHED_IF_IDLE=0 #Same as default configuration
DRIVERS_AUTOPROBE=0 #Same as default configuration


NAME=0bd5N1,0bd6N1
NAME=0bd5N1,0bd6N1,0bd4N1
VF_LMEM=128849018880
VF_LMEM_ECC=128849018880
VF_CONTEXTS=2048
Expand All @@ -174,7 +174,7 @@ PF_PREEMPT_TIMEOUT=128000
SCHED_IF_IDLE=0
DRIVERS_AUTOPROBE=0

NAME=0bd5N2,0bd6N2
NAME=0bd5N2,0bd6N2,0bd4N2
VF_LMEM=64424509440
VF_LMEM_ECC=64424509440
VF_CONTEXTS=1024
Expand All @@ -187,7 +187,7 @@ PF_PREEMPT_TIMEOUT=128000
SCHED_IF_IDLE=0
DRIVERS_AUTOPROBE=0

NAME=0bd5N4,0bd6N4
NAME=0bd5N4,0bd6N4,0bd4N4
VF_LMEM=32212254720
VF_LMEM_ECC=32212254720
VF_CONTEXTS=1024
Expand All @@ -200,7 +200,7 @@ PF_PREEMPT_TIMEOUT=128000
SCHED_IF_IDLE=0
DRIVERS_AUTOPROBE=0

NAME=0bd5N6,0bd6N6
NAME=0bd5N6,0bd6N6,0bd4N6
VF_LMEM=21474836480
VF_LMEM_ECC=21474836480
VF_CONTEXTS=1024
Expand All @@ -213,7 +213,7 @@ PF_PREEMPT_TIMEOUT=128000
SCHED_IF_IDLE=0
DRIVERS_AUTOPROBE=0

NAME=0bd5N8,0bd6N8
NAME=0bd5N8,0bd6N8,0bd4N8
VF_LMEM=16106127360
VF_LMEM_ECC=16106127360
VF_CONTEXTS=1024
Expand All @@ -226,7 +226,7 @@ PF_PREEMPT_TIMEOUT=128000
SCHED_IF_IDLE=0
DRIVERS_AUTOPROBE=0

NAME=0bd5N16,0bd6N16
NAME=0bd5N16,0bd6N16,0bd4N16
VF_LMEM=8053063680
VF_LMEM_ECC=8053063680
VF_CONTEXTS=1024
Expand All @@ -239,7 +239,7 @@ PF_PREEMPT_TIMEOUT=16000
SCHED_IF_IDLE=1
DRIVERS_AUTOPROBE=0

NAME=0bd5N32,0bd6N32
NAME=0bd5N32,0bd6N32,0bd4N32
VF_LMEM=4026531840
VF_LMEM_ECC=4026531840
VF_CONTEXTS=1024
Expand All @@ -252,7 +252,7 @@ PF_PREEMPT_TIMEOUT=8000
SCHED_IF_IDLE=1
DRIVERS_AUTOPROBE=0

NAME=0bd5N62,0bd6N62
NAME=0bd5N62,0bd6N62,0bd4N62
VF_LMEM=2013265920
VF_LMEM_ECC=2013265920
VF_CONTEXTS=1024
Expand All @@ -266,7 +266,7 @@ SCHED_IF_IDLE=1
DRIVERS_AUTOPROBE=0

# If a specified number of VGPU is not in this file, the configuration below would be used.
NAME=0bd5DEF,0bd6DEF
NAME=0bd5DEF,0bd6DEF,0bd4DEF
VF_LMEM=128849018880 #Equally divided to each vgpu
VF_LMEM_ECC=128849018880 #Equally divided to each vgpu
VF_CONTEXTS=1024 #Same as default configuration
Expand All @@ -280,7 +280,7 @@ SCHED_IF_IDLE=1 #Same as default configuration
DRIVERS_AUTOPROBE=0 #Same as default configuration


NAME=0bdaN1,0bdbN1
NAME=0bdaN1,0bdbN1,0b6eN1
VF_LMEM=47244640256
VF_LMEM_ECC=47244640256
VF_CONTEXTS=1024
Expand All @@ -293,7 +293,7 @@ PF_PREEMPT_TIMEOUT=128000
SCHED_IF_IDLE=0
DRIVERS_AUTOPROBE=0

NAME=0bdaN2,0bdbN2
NAME=0bdaN2,0bdbN2,0b6eN2
VF_LMEM=23622320128
VF_LMEM_ECC=23622320128
VF_CONTEXTS=1024
Expand All @@ -306,7 +306,7 @@ PF_PREEMPT_TIMEOUT=128000
SCHED_IF_IDLE=0
DRIVERS_AUTOPROBE=0

NAME=0bdaN3,0bdbN3
NAME=0bdaN3,0bdbN3,0b6eN3
VF_LMEM=15748213418
VF_LMEM_ECC=15748213418
VF_CONTEXTS=1024
Expand All @@ -319,7 +319,7 @@ PF_PREEMPT_TIMEOUT=128000
SCHED_IF_IDLE=0
DRIVERS_AUTOPROBE=0

NAME=0bdaN4,0bdbN4
NAME=0bdaN4,0bdbN4,0b6eN4
VF_LMEM=11811160064
VF_LMEM_ECC=11811160064
VF_CONTEXTS=1024
Expand All @@ -332,7 +332,7 @@ PF_PREEMPT_TIMEOUT=128000
SCHED_IF_IDLE=0
DRIVERS_AUTOPROBE=0

NAME=0bdaN8,0bdbN8
NAME=0bdaN8,0bdbN8,0b6eN8
VF_LMEM=5905580032
VF_LMEM_ECC=5905580032
VF_CONTEXTS=1024
Expand All @@ -345,7 +345,7 @@ PF_PREEMPT_TIMEOUT=16000
SCHED_IF_IDLE=1
DRIVERS_AUTOPROBE=0

NAME=0bdaN16,0bdbN16
NAME=0bdaN16,0bdbN16,0b6eN16
VF_LMEM=2952790016
VF_LMEM_ECC=2952790016
VF_CONTEXTS=1024
Expand All @@ -358,7 +358,7 @@ PF_PREEMPT_TIMEOUT=8000
SCHED_IF_IDLE=1
DRIVERS_AUTOPROBE=0

NAME=0bdaN32,0bdbN32
NAME=0bdaN32,0bdbN32,0b6eN32
VF_LMEM=1476395008
VF_LMEM_ECC=1476395008
VF_CONTEXTS=1024
Expand All @@ -371,7 +371,7 @@ PF_PREEMPT_TIMEOUT=4000
SCHED_IF_IDLE=1
DRIVERS_AUTOPROBE=0

NAME=0bdaN63,0bdbN63
NAME=0bdaN63,0bdbN63,0b6eN63
VF_LMEM=738197504
VF_LMEM_ECC=738197504
VF_CONTEXTS=1024
Expand All @@ -385,7 +385,7 @@ SCHED_IF_IDLE=1
DRIVERS_AUTOPROBE=0

# If a specified number of VGPU is not in this file, the configuration below would be used.
NAME=0bdaDEF,0bdbDEF
NAME=0bdaDEF,0bdbDEF,0b6eDEF
VF_LMEM=47244640256 #Equally divided to each vgpu
VF_LMEM_ECC=47244640256 #Equally divided to each vgpu
VF_CONTEXTS=1024 #Same as default configuration
Expand Down
2 changes: 2 additions & 0 deletions core/src/api/device_model.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ int getDeviceModelByPciDeviceId(int deviceId) {
return XPUM_DEVICE_MODEL_ATS_M_3;
case 0x0b69:
case 0x0bd0:
case 0x0bd4:
case 0x0bd5:
case 0x0bd6:
case 0x0bd7:
Expand All @@ -25,6 +26,7 @@ int getDeviceModelByPciDeviceId(int deviceId) {
case 0x0bda:
case 0x0bdb:
case 0x0be5:
case 0x0b6e:
return XPUM_DEVICE_MODEL_PVC;
case 0x4907:
return XPUM_DEVICE_MODEL_SG1;
Expand Down
5 changes: 5 additions & 0 deletions core/src/data_logic/engine_group_utilization_data_handler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,11 @@ void EngineGroupUtilizationDataHandler::calculateData(std::shared_ptr<SharedData
auto pre_extended = pre_data->second->getExtendedDatas()->find(extended_data->first);
if (pre_extended != pre_data->second->getExtendedDatas()->end()) {
if (extended_data->second.type == ZES_ENGINE_GROUP_COMPUTE_ALL || extended_data->second.type == ZES_ENGINE_GROUP_RENDER_ALL || extended_data->second.type == ZES_ENGINE_GROUP_MEDIA_ALL || extended_data->second.type == ZES_ENGINE_GROUP_COPY_ALL || extended_data->second.type == ZES_ENGINE_GROUP_3D_ALL) {
if (extended_data->second.timestamp ==
pre_extended->second.timestamp) {
++extended_data;
continue;
}
uint64_t val = Configuration::DEFAULT_MEASUREMENT_DATA_SCALE * 100 * (extended_data->second.active_time - pre_extended->second.active_time) / (extended_data->second.timestamp - pre_extended->second.timestamp);
if (val > Configuration::DEFAULT_MEASUREMENT_DATA_SCALE * 100) {
val = Configuration::DEFAULT_MEASUREMENT_DATA_SCALE * 100;
Expand Down
Loading

0 comments on commit dcc3345

Please sign in to comment.