[Clang] Add timeout for GPU detection utilities #94751

jhuber6 · 2024-06-07T13:17:56Z

Summary:
The utilities nvptx-arch and amdgpu-arch are used to support
--offload-arch=native among other utilities in clang. However, these
rely on the GPU drivers to query the features. In certain cases these
drivers can become locked up, which will lead to indefinate hangs on any
compiler jobs running in the meantime.

This patch adds a ten second timeout period for these utilities before
it kills the job and errors out.

llvmbot · 2024-06-07T13:18:31Z

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-clang-driver

Author: Joseph Huber (jhuber6)

Changes

Summary:
The utilities nvptx-arch and amdgpu-arch are used to support
--offload-arch=native among other utilities in clang. However, these
rely on the GPU drivers to query the features. In certain cases these
drivers can become locked up, which will lead to indefinate hangs on any
compiler jobs running in the meantime.

This patch adds a ten second timeout period for these utilities before
it kills the job and errors out.

Full diff: https://github.com/llvm/llvm-project/pull/94751.diff

4 Files Affected:

(modified) clang/include/clang/Driver/ToolChain.h (+1-1)
(modified) clang/lib/Driver/ToolChain.cpp (+4-4)
(modified) clang/lib/Driver/ToolChains/AMDGPU.cpp (+1-1)
(modified) clang/lib/Driver/ToolChains/Cuda.cpp (+1-1)

diff --git a/clang/include/clang/Driver/ToolChain.h b/clang/include/clang/Driver/ToolChain.h
index a4f9cad98aa8b..87a5034dfd78b 100644
--- a/clang/include/clang/Driver/ToolChain.h
+++ b/clang/include/clang/Driver/ToolChain.h
@@ -205,7 +205,7 @@ class ToolChain {
 
   /// Executes the given \p Executable and returns the stdout.
   llvm::Expected<std::unique_ptr<llvm::MemoryBuffer>>
-  executeToolChainProgram(StringRef Executable) const;
+  executeToolChainProgram(StringRef Executable, unsigned Timeout = 0) const;
 
   void setTripleEnvironment(llvm::Triple::EnvironmentType Env);
 
diff --git a/clang/lib/Driver/ToolChain.cpp b/clang/lib/Driver/ToolChain.cpp
index 0e86bc07e0ea2..8c746ac8066cb 100644
--- a/clang/lib/Driver/ToolChain.cpp
+++ b/clang/lib/Driver/ToolChain.cpp
@@ -104,7 +104,8 @@ ToolChain::ToolChain(const Driver &D, const llvm::Triple &T,
 }
 
 llvm::Expected<std::unique_ptr<llvm::MemoryBuffer>>
-ToolChain::executeToolChainProgram(StringRef Executable) const {
+ToolChain::executeToolChainProgram(StringRef Executable,
+                                   unsigned Timeout) const {
   llvm::SmallString<64> OutputFile;
   llvm::sys::fs::createTemporaryFile("toolchain-program", "txt", OutputFile);
   llvm::FileRemover OutputRemover(OutputFile.c_str());
@@ -115,9 +116,8 @@ ToolChain::executeToolChainProgram(StringRef Executable) const {
   };
 
   std::string ErrorMessage;
-  if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects,
-                                /* SecondsToWait */ 0,
-                                /*MemoryLimit*/ 0, &ErrorMessage))
+  if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects, Timeout,
+                                /*MemoryLimit=*/0, &ErrorMessage))
     return llvm::createStringError(std::error_code(),
                                    Executable + ": " + ErrorMessage);
 
diff --git a/clang/lib/Driver/ToolChains/AMDGPU.cpp b/clang/lib/Driver/ToolChains/AMDGPU.cpp
index 9ffea57b005de..92895d8186e83 100644
--- a/clang/lib/Driver/ToolChains/AMDGPU.cpp
+++ b/clang/lib/Driver/ToolChains/AMDGPU.cpp
@@ -877,7 +877,7 @@ AMDGPUToolChain::getSystemGPUArchs(const ArgList &Args) const {
   else
     Program = GetProgramPath("amdgpu-arch");
 
-  auto StdoutOrErr = executeToolChainProgram(Program);
+  auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10);
   if (!StdoutOrErr)
     return StdoutOrErr.takeError();
 
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index bbc8be91fd70b..47dac0e439f10 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -826,7 +826,7 @@ NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) const {
   else
     Program = GetProgramPath("nvptx-arch");
 
-  auto StdoutOrErr = executeToolChainProgram(Program);
+  auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10);
   if (!StdoutOrErr)
     return StdoutOrErr.takeError();

jhuber6 · 2024-06-07T13:20:01Z

No active test because I have no clue how you would, but I intentionally made it time out and it returns a 'Child timed out` error as expected.

clang/include/clang/Driver/ToolChain.h

Summary: The utilities `nvptx-arch` and `amdgpu-arch` are used to support `--offload-arch=native` among other utilities in clang. However, these rely on the GPU drivers to query the features. In certain cases these drivers can become locked up, which will lead to indefinate hangs on any compiler jobs running in the meantime. This patch adds a ten second timeout period for these utilities before it kills the job and errors out.

Artem-B · 2024-06-07T18:32:08Z

Ooh... I think I know exactly what may be causing this.

On machines where NVIDIA GPUs are used for compute only (e.g. a headless server machine), NVIDIA drivers are not always loaded by default and may not have driver persistence enabled. The drivers get loaded when GPU is accessed, and then released and unloaded when there are no GPU users remaining. A parallel compilation with --offload-arch=native will be the worst-case stress test scenario for the driver init/deinit machinery, as GPU probing is both short-living and will be done repeatedly.

Adding a timeout here would help, sort of, but it would be much better if we could figure out a way to either detect that GPU probing takes too long (and likely causes the driver to load/unload), or cache probing results somehow, so we do not have to run the same detection over and over again. This is a point towards pushing the detection out of clang into the build system, which would be the better place to do it.

For the GPU detection, we may be able to work around the issue by leaving the detection app running for the duration of the compilation, and prevent driver unloading, but it's a rather gross hack.

jhuber6 · 2024-06-07T18:37:11Z

Ooh... I think I know exactly what may be causing this.

I've observed this a few times. For my case it's usually when some application hangs on the GPU and no one notices, then these tools hang forever and it takes awhile to notice. Figured an error is friendlier since I highly doubt these tools will take over ten seconds to run even in the worst case.

On machines where NVIDIA GPUs are used for compute only (e.g. a headless server machine), NVIDIA drivers are not always loaded by default and may not have driver persistence enabled.

What's the config to set this by default without any graphics? Would be nice to not need to worry about it on my dev machine.

For the GPU detection, we may be able to work around the issue by leaving the detection app running for the duration of the compilation, and prevent driver unloading, but it's a rather gross hack.

I know for AMD stuff we used to just probe the PCI connections, but that leaked a lot of information so this is the easier way to do it. I wonder what __nvcc_device_query does internally.

Artem-B · 2024-06-07T18:56:45Z

What's the config to set this by default without any graphics?

https://docs.nvidia.com/deploy/driver-persistence/index.html

I usually use "nvidia-smi -i -pm ENABLED" to force the driver to be loaded permanently.

As for __nvcc_device_query, my guess is that it just uses a handful of driver API calls. Nothing unusual, it's probably the standard sequence of cuInit/cuDeviceGetCount/cuDeviceGetAttribute like in this example:
https://github.com/NVIDIA/cuda-samples/blob/5f97d7d0dff880bc6567faa4c5e62e389a6d6999/Samples/1_Utilities/deviceQueryDrv/deviceQueryDrv.cpp#L56

  checkCudaErrors(cuInit(0));
  checkCudaErrors(cuDeviceGetCount(&deviceCount));
  for (dev = 0; dev < deviceCount; ++dev) {
    checkCudaErrors(cuDeviceGetName(deviceName, 256, dev));
    ...
  }

jhuber6 requested review from Artem-B, jdoerfert, JonChesterfield, MaskRay, shiltian and yxsamliu June 7, 2024 13:17

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Jun 7, 2024

arsenm reviewed Jun 7, 2024

View reviewed changes

clang/include/clang/Driver/ToolChain.h Outdated Show resolved Hide resolved

jhuber6 force-pushed the Timeout branch from a7bcd6b to 0e367c7 Compare June 7, 2024 13:23

arsenm approved these changes Jun 7, 2024

View reviewed changes

jhuber6 merged commit 2981f3a into llvm:main Jun 7, 2024
4 of 6 checks passed

HerrCai0907 mentioned this pull request Jun 13, 2024

tidy #95384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Clang] Add timeout for GPU detection utilities #94751

[Clang] Add timeout for GPU detection utilities #94751

jhuber6 commented Jun 7, 2024

llvmbot commented Jun 7, 2024 •

edited

Loading

jhuber6 commented Jun 7, 2024

Artem-B commented Jun 7, 2024

jhuber6 commented Jun 7, 2024

Artem-B commented Jun 7, 2024

[Clang] Add timeout for GPU detection utilities #94751

[Clang] Add timeout for GPU detection utilities #94751

Conversation

jhuber6 commented Jun 7, 2024

llvmbot commented Jun 7, 2024 • edited Loading

jhuber6 commented Jun 7, 2024

Artem-B commented Jun 7, 2024

jhuber6 commented Jun 7, 2024

Artem-B commented Jun 7, 2024

llvmbot commented Jun 7, 2024 •

edited

Loading