Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Clang] Add timeout for GPU detection utilities #94751

Merged
merged 1 commit into from
Jun 7, 2024
Merged

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Jun 7, 2024

Summary:
The utilities nvptx-arch and amdgpu-arch are used to support
--offload-arch=native among other utilities in clang. However, these
rely on the GPU drivers to query the features. In certain cases these
drivers can become locked up, which will lead to indefinate hangs on any
compiler jobs running in the meantime.

This patch adds a ten second timeout period for these utilities before
it kills the job and errors out.

@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Jun 7, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Jun 7, 2024

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-clang-driver

Author: Joseph Huber (jhuber6)

Changes

Summary:
The utilities nvptx-arch and amdgpu-arch are used to support
--offload-arch=native among other utilities in clang. However, these
rely on the GPU drivers to query the features. In certain cases these
drivers can become locked up, which will lead to indefinate hangs on any
compiler jobs running in the meantime.

This patch adds a ten second timeout period for these utilities before
it kills the job and errors out.


Full diff: https://github.com/llvm/llvm-project/pull/94751.diff

4 Files Affected:

  • (modified) clang/include/clang/Driver/ToolChain.h (+1-1)
  • (modified) clang/lib/Driver/ToolChain.cpp (+4-4)
  • (modified) clang/lib/Driver/ToolChains/AMDGPU.cpp (+1-1)
  • (modified) clang/lib/Driver/ToolChains/Cuda.cpp (+1-1)
diff --git a/clang/include/clang/Driver/ToolChain.h b/clang/include/clang/Driver/ToolChain.h
index a4f9cad98aa8b..87a5034dfd78b 100644
--- a/clang/include/clang/Driver/ToolChain.h
+++ b/clang/include/clang/Driver/ToolChain.h
@@ -205,7 +205,7 @@ class ToolChain {
 
   /// Executes the given \p Executable and returns the stdout.
   llvm::Expected<std::unique_ptr<llvm::MemoryBuffer>>
-  executeToolChainProgram(StringRef Executable) const;
+  executeToolChainProgram(StringRef Executable, unsigned Timeout = 0) const;
 
   void setTripleEnvironment(llvm::Triple::EnvironmentType Env);
 
diff --git a/clang/lib/Driver/ToolChain.cpp b/clang/lib/Driver/ToolChain.cpp
index 0e86bc07e0ea2..8c746ac8066cb 100644
--- a/clang/lib/Driver/ToolChain.cpp
+++ b/clang/lib/Driver/ToolChain.cpp
@@ -104,7 +104,8 @@ ToolChain::ToolChain(const Driver &D, const llvm::Triple &T,
 }
 
 llvm::Expected<std::unique_ptr<llvm::MemoryBuffer>>
-ToolChain::executeToolChainProgram(StringRef Executable) const {
+ToolChain::executeToolChainProgram(StringRef Executable,
+                                   unsigned Timeout) const {
   llvm::SmallString<64> OutputFile;
   llvm::sys::fs::createTemporaryFile("toolchain-program", "txt", OutputFile);
   llvm::FileRemover OutputRemover(OutputFile.c_str());
@@ -115,9 +116,8 @@ ToolChain::executeToolChainProgram(StringRef Executable) const {
   };
 
   std::string ErrorMessage;
-  if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects,
-                                /* SecondsToWait */ 0,
-                                /*MemoryLimit*/ 0, &ErrorMessage))
+  if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects, Timeout,
+                                /*MemoryLimit=*/0, &ErrorMessage))
     return llvm::createStringError(std::error_code(),
                                    Executable + ": " + ErrorMessage);
 
diff --git a/clang/lib/Driver/ToolChains/AMDGPU.cpp b/clang/lib/Driver/ToolChains/AMDGPU.cpp
index 9ffea57b005de..92895d8186e83 100644
--- a/clang/lib/Driver/ToolChains/AMDGPU.cpp
+++ b/clang/lib/Driver/ToolChains/AMDGPU.cpp
@@ -877,7 +877,7 @@ AMDGPUToolChain::getSystemGPUArchs(const ArgList &Args) const {
   else
     Program = GetProgramPath("amdgpu-arch");
 
-  auto StdoutOrErr = executeToolChainProgram(Program);
+  auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10);
   if (!StdoutOrErr)
     return StdoutOrErr.takeError();
 
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index bbc8be91fd70b..47dac0e439f10 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -826,7 +826,7 @@ NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) const {
   else
     Program = GetProgramPath("nvptx-arch");
 
-  auto StdoutOrErr = executeToolChainProgram(Program);
+  auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10);
   if (!StdoutOrErr)
     return StdoutOrErr.takeError();
 

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 7, 2024

No active test because I have no clue how you would, but I intentionally made it time out and it returns a 'Child timed out` error as expected.

Summary:
The utilities `nvptx-arch` and `amdgpu-arch` are used to support
`--offload-arch=native` among other utilities in clang. However, these
rely on the GPU drivers to query the features. In certain cases these
drivers can become locked up, which will lead to indefinate hangs on any
compiler jobs running in the meantime.

This patch adds a ten second timeout period for these utilities before
it kills the job and errors out.
@jhuber6 jhuber6 merged commit 2981f3a into llvm:main Jun 7, 2024
4 of 6 checks passed
@Artem-B
Copy link
Member

Artem-B commented Jun 7, 2024

Ooh... I think I know exactly what may be causing this.

On machines where NVIDIA GPUs are used for compute only (e.g. a headless server machine), NVIDIA drivers are not always loaded by default and may not have driver persistence enabled. The drivers get loaded when GPU is accessed, and then released and unloaded when there are no GPU users remaining. A parallel compilation with --offload-arch=native will be the worst-case stress test scenario for the driver init/deinit machinery, as GPU probing is both short-living and will be done repeatedly.

Adding a timeout here would help, sort of, but it would be much better if we could figure out a way to either detect that GPU probing takes too long (and likely causes the driver to load/unload), or cache probing results somehow, so we do not have to run the same detection over and over again. This is a point towards pushing the detection out of clang into the build system, which would be the better place to do it.

For the GPU detection, we may be able to work around the issue by leaving the detection app running for the duration of the compilation, and prevent driver unloading, but it's a rather gross hack.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 7, 2024

Ooh... I think I know exactly what may be causing this.

I've observed this a few times. For my case it's usually when some application hangs on the GPU and no one notices, then these tools hang forever and it takes awhile to notice. Figured an error is friendlier since I highly doubt these tools will take over ten seconds to run even in the worst case.

On machines where NVIDIA GPUs are used for compute only (e.g. a headless server machine), NVIDIA drivers are not always loaded by default and may not have driver persistence enabled.

What's the config to set this by default without any graphics? Would be nice to not need to worry about it on my dev machine.

For the GPU detection, we may be able to work around the issue by leaving the detection app running for the duration of the compilation, and prevent driver unloading, but it's a rather gross hack.

I know for AMD stuff we used to just probe the PCI connections, but that leaked a lot of information so this is the easier way to do it. I wonder what __nvcc_device_query does internally.

@Artem-B
Copy link
Member

Artem-B commented Jun 7, 2024

What's the config to set this by default without any graphics?

https://docs.nvidia.com/deploy/driver-persistence/index.html

I usually use "nvidia-smi -i -pm ENABLED" to force the driver to be loaded permanently.

As for __nvcc_device_query, my guess is that it just uses a handful of driver API calls. Nothing unusual, it's probably the standard sequence of cuInit/cuDeviceGetCount/cuDeviceGetAttribute like in this example:
https://github.com/NVIDIA/cuda-samples/blob/5f97d7d0dff880bc6567faa4c5e62e389a6d6999/Samples/1_Utilities/deviceQueryDrv/deviceQueryDrv.cpp#L56

  checkCudaErrors(cuInit(0));
  checkCudaErrors(cuDeviceGetCount(&deviceCount));
  for (dev = 0; dev < deviceCount; ++dev) {
    checkCudaErrors(cuDeviceGetName(deviceName, 256, dev));
    ...
  }

@HerrCai0907 HerrCai0907 mentioned this pull request Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants