Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nvcc in sccache-dist #2247

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

trxcllnt
Copy link
Contributor

@trxcllnt trxcllnt commented Aug 19, 2024

This PR implements the features described in #2238. The reasons and benefits for this PR are described there, so I'll focus on the details of the implementation here.

First, I'll cover how to use and modify nvcc --dryrun output to reliably yield cache hits and misses. Then I'll cover the sccache changes necessary to implement this feature. Last, I'll show the results of a fully-uncached NVIDIA RAPIDS build with these changes applied.

Table of contents:

CUDA compilation

To distribute and cache CUDA compilations, it's necessary to understand the anatomy of an nvcc call, and how the GPU architecture options impact what is compiled into the final object.

Anatomy of an nvcc call

As noted in #2238, nvcc is a wrapper around the host compiler and internal CUDA device compilers. Notably for sccache, nvcc can compile a source file into an object that runs on multiple GPU architectures.

Two kinds of device code can be embedded into the final object: PTX and cubins. A .ptx file is assembly, and a .cubin file is the assembled GPU code. A .cubin is valid for any GPU architecture in the same family, e.g. sm_70.cubin runs on Volta and Turing (but not Ampere), and sm_80.cubin runs on Ampere and Ada (but not Hopper).

Applications that wish to run on heterogeneous GPU architectures embed cubins for their supported architecture families, as well as PTX for the latest architecture. If the application is run on a newer GPU than what's been embedded, the CUDA driver will JIT the embedded PTX into GPU code at runtime.

This is achieved by the -gencode= flags:

$ nvcc -gencode=arch=compute_70,code=[sm_70]
#                    ^                ^
#       compile arch 70 PTX           assemble and embed arch 70 cubin

$ nvcc -gencode=arch=compute_80,code=[compute_80,sm_80]
#                    ^                ^          ^
#       compile arch 80 PTX  embed arch 80 PTX   assemble and embed arch 80 cubin

The nvcc --dryrun flag shows how nvcc achieves this:

$ nvcc -c x.cu -o x.cu.o -gencode=arch=compute_70,code=[sm_70] -gencode=arch=compute_80,code=[compute_80,sm_80] --keep --dryrun
$ nvcc --version | grep -i release
Cuda compilation tools, release 12.6, V12.6.20
$ nvcc -c x.cu -o x.cu.o -gencode=arch=compute_70,code=[sm_70] -gencode=arch=compute_80,code=[compute_80,sm_80] --keep --dryrun
#$ _NVVM_BRANCH_=nvvm
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/usr/local/cuda/bin
#$ _THERE_=/usr/local/cuda/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_DIR_=targets/x86_64-linux
#$ TOP=/usr/local/cuda/bin/..
#$ CICC_PATH=/usr/local/cuda/bin/../nvvm/bin
#$ NVVMIR_LIBRARY_DIR=/usr/local/cuda/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/usr/local/cuda/bin/../lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
#$ PATH=/usr/local/cuda/bin/../nvvm/bin:/usr/local/cuda/bin:/vscode/vscode-server/bin/linux-x64/eaa41d57266683296de7d118f574d0c2652e1fc4/bin/remote-cli:/home/coder/.local/bin:/home/coder/bin:/usr/local/cargo/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
#$ INCLUDES="-I/usr/local/cuda/bin/../targets/x86_64-linux/include"
#$ LIBRARIES=  "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -D__CUDA_ARCH_LIST__=700,800 -D__NV_LEGACY_LAUNCH -E -x c++ -D__CUDACC__ -D__NVCC__  "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=6 -D__CUDACC_VER_BUILD__=20 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=6 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "x.cpp4.ii"
#$ cudafe++ --c++17 --gnu_version=110400 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed  --m64 --parse_templates --gen_c_file_name "x.compute_80.cudafe1.cpp" --stub_file_name "x.compute_80.cudafe1.stub.c" --gen_module_id_file --module_id_file_name "x.module_id" "x.cpp4.ii"
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=700,800 -D__NV_LEGACY_LAUNCH -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__  "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=6 -D__CUDACC_VER_BUILD__=20 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=6 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "x.compute_70.cpp1.ii"
#$ "$CICC_PATH/cicc" --c++17 --gnu_version=110400 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed   -arch compute_70 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "x.fatbin.c" -tused --module_id_file_name "x.module_id" --gen_c_file_name "x.compute_70.cudafe1.c" --stub_file_name "x.compute_70.cudafe1.stub.c" --gen_device_file_name "x.compute_70.cudafe1.gpu"  "x.compute_70.cpp1.ii" -o "x.compute_70.ptx"
#$ ptxas -arch=sm_70 -m64  "x.compute_70.ptx"  -o "x.compute_70.cubin"
#$ gcc -D__CUDA_ARCH__=800 -D__CUDA_ARCH_LIST__=700,800 -D__NV_LEGACY_LAUNCH -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__  "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=6 -D__CUDACC_VER_BUILD__=20 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=6 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "x.compute_80.cpp1.ii"
#$ "$CICC_PATH/cicc" --c++17 --gnu_version=110400 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed   -arch compute_80 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "x.fatbin.c" -tused --module_id_file_name "x.module_id" --gen_c_file_name "x.compute_80.cudafe1.c" --stub_file_name "x.compute_80.cudafe1.stub.c" --gen_device_file_name "x.compute_80.cudafe1.gpu"  "x.compute_80.cpp1.ii" -o "x.compute_80.ptx"
#$ ptxas -arch=sm_80 -m64  "x.compute_80.ptx"  -o "x.compute_80.sm_80.cubin"
#$ fatbinary --create="x.fatbin" -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=70,file=x.compute_70.cubin" "--image3=kind=ptx,sm=80,file=x.compute_80.ptx" "--image3=kind=elf,sm=80,file=x.compute_80.sm_80.cubin" --embedded-fatbin="x.fatbin.c"
#$ gcc -D__CUDA_ARCH__=800 -D__CUDA_ARCH_LIST__=700,800 -D__NV_LEGACY_LAUNCH -c -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -Wno-psabi "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"   -m64 "x.compute_80.cudafe1.cpp" -o "x.cu.o" 

This output can be grouped into sections of commands that must run sequentially (whether local or distributed).
Each group may depend on previous groups, but some groups can be executed in parallel.

1. This group is a list of environment variables needed by the later sections:
#$ _NVVM_BRANCH_=nvvm
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/usr/local/cuda/bin
#$ _THERE_=/usr/local/cuda/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_DIR_=targets/x86_64-linux
#$ TOP=/usr/local/cuda/bin/..
#$ CICC_PATH=/usr/local/cuda/bin/../nvvm/bin
#$ NVVMIR_LIBRARY_DIR=/usr/local/cuda/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/usr/local/cuda/bin/../lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
#$ PATH=/usr/local/cuda/bin/../nvvm/bin:/usr/local/cuda/bin:/vscode/vscode-server/bin/linux-x64/eaa41d57266683296de7d118f574d0c2652e1fc4/bin/remote-cli:/home/coder/.local/bin:/home/coder/bin:/usr/local/cargo/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
#$ INCLUDES="-I/usr/local/cuda/bin/../targets/x86_64-linux/include"
#$ LIBRARIES=  "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
2. This group preprocesses the source file into a form that embeds the GPU code into the final object:
#$ gcc -D__CUDA_ARCH_LIST__=700,800 -D__NV_LEGACY_LAUNCH -E -x c++ -D__CUDACC__ -D__NVCC__  "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=6 -D__CUDACC_VER_BUILD__=20 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=6 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "x.cpp4.ii"
#$ cudafe++ --c++17 --gnu_version=110400 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed  --m64 --parse_templates --gen_c_file_name "x.compute_80.cudafe1.cpp" --stub_file_name "x.compute_80.cudafe1.stub.c" --gen_module_id_file --module_id_file_name "x.module_id" "x.cpp4.ii" 
3. This group compiles the source file to arch 70 PTX, then assembles it into an arch 70 cubin:
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=700,800 -D__NV_LEGACY_LAUNCH -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__  "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=6 -D__CUDACC_VER_BUILD__=20 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=6 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "x.compute_70.cpp1.ii"
#$ "$CICC_PATH/cicc" --c++17 --gnu_version=110400 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed   -arch compute_70 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "x.fatbin.c" -tused --module_id_file_name "x.module_id" --gen_c_file_name "x.compute_70.cudafe1.c" --stub_file_name "x.compute_70.cudafe1.stub.c" --gen_device_file_name "x.compute_70.cudafe1.gpu"  "x.compute_70.cpp1.ii" -o "x.compute_70.ptx"
#$ ptxas -arch=sm_70 -m64  "x.compute_70.ptx"  -o "x.compute_70.cubin" 
4. This group does the same as the the third group, except for arch 80:
#$ gcc -D__CUDA_ARCH__=800 -D__CUDA_ARCH_LIST__=700,800 -D__NV_LEGACY_LAUNCH -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__  "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=6 -D__CUDACC_VER_BUILD__=20 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=6 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "x.compute_80.cpp1.ii"
#$ "$CICC_PATH/cicc" --c++17 --gnu_version=110400 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed   -arch compute_80 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "x.fatbin.c" -tused --module_id_file_name "x.module_id" --gen_c_file_name "x.compute_80.cudafe1.c" --stub_file_name "x.compute_80.cudafe1.stub.c" --gen_device_file_name "x.compute_80.cudafe1.gpu"  "x.compute_80.cpp1.ii" -o "x.compute_80.ptx"
#$ ptxas -arch=sm_80 -m64  "x.compute_80.ptx"  -o "x.compute_80.sm_80.cubin" 
5. This group assembles the PTX and cubins into a fatbin, then compiles step 2's preprocessor output to an object with the fatbin embedded:
#$ fatbinary --create="x.fatbin" -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=70,file=x.compute_70.cubin" "--image3=kind=ptx,sm=80,file=x.compute_80.ptx" "--image3=kind=elf,sm=80,file=x.compute_80.sm_80.cubin" --embedded-fatbin="x.fatbin.c"
#$ gcc -D__CUDA_ARCH__=800 -D__CUDA_ARCH_LIST__=700,800 -D__NV_LEGACY_LAUNCH -c -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -Wno-psabi "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"   -m64 "x.compute_80.cudafe1.cpp" -o "x.cu.o"

The above commands as a DAG:

flowchart TD
    subgraph Legend
        local["Must run locally"]
        dist("Can be distributed")
    end
    subgraph "nvcc --dryun flowchart"
        nvcc["nvcc"] --> cudafe["Preprocessor and cudafe++"]
        cudafe --> preprocessor-arch-70["Preprocess for arch 70"]
        cudafe --> preprocessor-arch-80["Preprocess for arch 80"]
        preprocessor-arch-70 --> cicc-arch-70("Compile PTX for arch 70")
        cicc-arch-70 --> ptxas-arch-70("Assemble cubin for arch 70")
        ptxas-arch-70 --> fatbin["Assemble arch 70 cubin and arch 80 PTX and cubin into fatbin"]
        preprocessor-arch-80 --> cicc-arch-80("Compile PTX for arch 80")
        cicc-arch-80 --> ptxas-arch-80("Assemble cubin for arch 80")
        ptxas-arch-80 --> fatbin("Embed arch 80 PTX + arch 70,80 cubins into fatbin")
        fatbin --> compile-final-object("Compile fatbin + host code to final object")
    end
Loading

The the nodes labeled "Can be distributed" are the most expensive to run, but are also cacheable.

For example, rebuilding with a subset of a prior build's architectures should be fast:

# Populates the cache with compute_{70,80}.{ptx,cubin}:
$ sccache nvcc -c x.cu -o x.cu.o -gencode=arch=compute_70,code=[sm_70] -gencode=arch=compute_80,code=[compute_80,sm_80]

# Rebuilding for arch 80 should load compute_80.{ptx,cubin} from the cache:
$ sccache nvcc -c x.cu -o x.cu.o -gencode=arch=compute_80,code=[compute_80,sm_80]

Impediments to caching

In theory, sccache should be able to parse the nvcc --dryrun output and execute each command. In practice, directly executing nvcc --dryrun commands yields cache misses when it should yield hits, necessitating careful modifications.

Random strings in file names

By default nvcc's generated files have randomly-generated strings as part of their file names. These strings end up in the preprocessor output, making it impossible to cache when the post-processed file is included in the cache key.

This behavior is disabled when the --keep flag is present, so it is essential to use nvcc --dryrun --keep to generate commands.

Architecture-dependent file names

The filenames generated for the intermediate .ii, .stub.c, and .cudafe1.{c,cpp} files are sensitive to the set of -gencode= flags. Because we're dealing with preprocessor output, these names leak into the post-processed output, again leading to cache misses.

The file names aren't relevant to the final result, and we can rename them (i.e. auto-incrementing by type) to yield cache hits.

Choice of build directory

Since we'd like to run the underlying nvcc commands directly, we need a scratch directory in which to work.

nvcc --dryrun --keep --keep-dir $(mktemp -d) would generate commands with paths to files in the temp dir, however this leads to the same issue as before: the path to the temp dir ends up as part of the post-processed files leading to unnecessary cache misses.

Additionally, since the original nvcc invocation can include relative paths, it's essential to either run the preprocessor from the original cwd, or canonicalize all argument paths. To align with existing sccache behavior, I chose the former approach.

My solution is to alternate which directory each command is executed from. For example, in pseudo-code:

cwd=$(pwd)
tmp=$(mktemp -d)

# step 2
(cd $cwd && gcc -E x.cu -o "$tmp/x.cpp4.ii")
(cd $tmp && cudafe++ --gen_c_file_name "x.cudafe1.cpp" "x.cpp4.ii")

# step 3
(cd $cwd && gcc -E "x.cu" -o "$tmp/x.compute_70.cpp1.ii")
(cd $tmp && cicc "x.cpp1.ii" -o "x.compute_70.ptx")
(cd $tmp && ptxas "x.compute_70.ptx" -o "x.compute_70.cubin")

# step 4
(cd $cwd && gcc -E "x.cu" -o "$tmp/x.compute_80.cpp1.ii")
(cd $tmp && cicc "x.cpp1.ii" -o "x.compute_80.ptx")
(cd $tmp && ptxas "x.compute_80.ptx" -o "x.compute_80.cubin")

# step 5
(cd $tmp && fatbinary ...)
(cd $cwd && gcc "$tmp/x.cudafe1.cpp" -o "x.cu.o"

This approach ensures only paths relative to the tempdir are ever in any post-processed output handled by nvcc. Along with renaming the files deterministically by extension, this ensures no false-negative cache misses from random paths leaking into preprocessor output.

sccache implementation

The high-level design of this feature is as follows:

  • Add cicc and ptxas as "top-level" compilers in sccache
  • Provide a way for the sccache server to recursively call itself with additional compile commands
  • Modify the Nvcc compiler to generate a compile command that:
    • parses nvcc --keep --dryrun for additional envvars and constructs the command group DAG
    • ensures appropriate subcommand groups run in parallel based on the user's nvcc --threads N value
    • invokes each preprocessor command directly, or recursively invokes sccache to produce each build product

Adding cicc and ptxas as top-level compilers

The cicc and ptxas compiler implementations are straightforward, if a bit odd compared to other CCompilerImpls.

cicc and ptxas arguments are not documented, and the NVIDIA compiler team may change them at any time. So the approach I've taken is to only process the subset of the arguments that impact caching (some of which are excluded from the computed hash), and pass through all other arguments as-is (and including these in the computed hash).

In addition to the -o x.ptx argument, cicc has 3 options that cause it to create additional outputs: --gen_c_file_name, --gen_device_file_name, and --stub_file_name. The file names can be different based on the -gencode flags, so they are excluded from the hash computed for the output .ptx file.

cicc also requires the .module_id file generated by cudafe++ as an input. This is available when performing a local compile, but requires adding an extra_dist_files list to the ParsedArguments and CInputsPackager structs.

Making sccache reentrant

In order for the Nvcc compiler to generate a CompileCommand that can load-or-compile-and-cache the underlying ptx and cubins, a mechanism for recursively calling sccache with additional compile commands needs to exist.

Theoretically we could re-invoke the sccache client binary as a subprocess, passing the sub-compiler as arguments .e.g sccache cicc ... (or make a request to the sccache server that communicates the same information), but this is a non-starter due to jobserver limitations. If each nvcc invocation spawns sub-invocations that are processed by the sccache server's Command::Compile matcher, a new jobserver slot is reserved. The outer nvcc slot is not released, and when all the jobserver slots are reserved, the server deadlocks. nvcc jobs are taking up slots waiting for sub-compiler invocations that never start due to being blocked by other nvcc jobs.

The other way to re-enter sccache with additional compile commands seems to be via SccacheService<T>. By providing the SccacheService instance to CompileCommand::execute(), Nvcc's CompileCommand implementation should be able to call it as necessary for each sub-command.

I refactored start_compile_task() into an async function that returns a Future<Result<CompileFinished>>, and refactored check_compiler() to spawn + join a tokio task for the start_compile_task() future. Then if we make both compiler_info() and start_compile_task() public, the Nvcc CompileCommand implementation can mimic the top-level handle_compile() logic without spawning additional tasks.

Customizing CompileCommand

After making the above changes, the last step is to refactor CompileCommand into a trait that supports different implementations of its logic. This was straightforward, and I modeled the code after the relationship between Compiler/CCompiler/CCompilerImpl.

  • CompileCommand<T> is a new trait with a constraint on CommandCreatorSync (because traits with generic functions are not "object safe")
  • CompileCommandImpl is a new trait with a generic execute<T>(service: &SccacheService<T>, creator: &T) -> Result<process::Output>
  • CCompileCommand<I> is a CompileCommand<T> implementation that owns a concrete CompileCommandImpl

There are two CompileCommandImpl implementations: SingleCompileCommand which is exactly the same as the original CompileCommand struct, and NvccCompileCommand which implements the additional logic for parsing nvcc --dryrun and transforming/calling the subcommands.

Final thoughts

I'm not a huge fan the changes to SccacheService<T> and CompileCommand.

I don't feel great about passing around references to a global just so CompileCommand can re-enter the sccache compilation flow. I understand SccacheService<T>::compiler_info() requires mutable state in SccacheService, but needing to pass the SccacheService reference to CompileCommand (+ mock it in CompileCommand tests) doesn't pass my personal smell-test.

Comparative build times

Here's two fully uncached NVIDIA RAPIDS builds using sccache v0.7.7 and the version in this PR, with total build time dropping from 4h 21m to 2h 18m 🎉:

sccache v0.7.7 local
$ build-all -j64
build times:
librmm:                    1m  5s
libucxx:                   0m  8s
libKvikIO:                 0m  6s
libcudf:                  27m 55s
libcudf_kafka:             0m  8s
libraft:                  64m  7s
libcuvs:                  36m 18s
libcumlprims_mg:           1m  3s
libcuml:                  33m 34s
libcugraph-ops:           26m 19s
libcugraph-ops-internal:   0m 54s
libwholegraph:             4m 48s
libcugraph:               58m 48s
libcugraph_etl:            1m  5s
libcuspatial:              5m  4s
                total: 4h 21m 22s
$ sccache -s
Compile requests                     4970
Compile requests executed            4970
Cache hits                            111
Cache hits (C/C++)                     83
Cache hits (CUDA)                      28
Cache misses                         4825
Cache misses (C/C++)                 2314
Cache misses (CUDA)                  2511
Cache timeouts                          0
Cache read errors                       0
Forced recaches                         0
Cache write errors                      0
Compilation failures                    6
Cache errors                           28
Cache errors (C/C++)                   28
Non-cacheable compilations              0
Non-cacheable calls                     0
Non-compilation calls                   0
Unsupported compiler calls              0
Average cache write                 0.002 s
Average compiler                  102.657 s
Average cache read hit              0.000 s
Failed distributed compilations         0
Version (client)                0.7.7
sccache v0.8.1 distributed
$ build-all -j128
build times:
librmm:                      0m 40.736s
libucxx:                     0m 10.899s
libKvikIO:                   0m  9.123s
libcudf:                    14m 51.700s
libcudf_kafka:               0m 14.190s
libraft:                    33m 22.507s
libcuvs:                    18m 28.300s
libcumlprims_mg:             0m 44.271s
libcuml:                    12m 58.909s
libcugraph-ops:             13m 58.254s
libcugraph-ops-internal:     0m 35.547s
libwholegraph:               2m 25.598s
libcugraph:                 36m 20.686s
libcugraph_etl:              0m 44.732s
libcuspatial:                3m  6.870s
                  total: 2h 18m 1s
$ sccache -s
Compile requests                    4970
Compile requests executed          32551
Cache hits                           497
Cache hits (C/C++)                    79
Cache hits (CUBIN)                   385
Cache hits (CUDA)                     28
Cache hits (PTX)                       5
Cache misses                       32020
Cache misses (C/C++)                4824
Cache misses (CUBIN)               12150
Cache misses (CUDA)                 2511
Cache misses (PTX)                 12535
Cache hits rate                     1.53 %
Cache hits rate (C/C++)             1.61 %
Cache hits rate (CUBIN)             3.07 %
Cache hits rate (CUDA)              1.10 %
Cache hits rate (PTX)               0.04 %
Cache timeouts                         0
Cache read errors                      0
Forced recaches                        0
Cache write errors                     0
Compilation failures                   6
Cache errors                          28
Cache errors (C/C++)                  28
Non-cacheable compilations             0
Non-cacheable calls                    0
Non-compilation calls                  0
Unsupported compiler calls             0
Average cache write                0.066 s
Average compiler                  27.166 s
Average cache read hit             0.057 s
Failed distributed compilations        0
Successful distributed compiles
  192.168.1.175:10501              16032
  192.168.1.147:10501              13477
Version (client)                0.8.1

Closes #2238

…ensure we don't spawn nested orphan tasks and leak a bunch of memory
@sylvestre
Copy link
Collaborator

@trxcllnt I am off for two weeks

@@ -1078,7 +1078,7 @@ mod client {
use super::urls;
use crate::errors::*;

const REQUEST_TIMEOUT_SECS: u64 = 600;
const REQUEST_TIMEOUT_SECS: u64 = 1200;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 minutes is too short to compile some complex kernels, so I bumped to 10 minutes here. It would be good to make this configurable.

@@ -118,6 +119,7 @@ object = "0.32"
rouille = { version = "3.6", optional = true, default-features = false, features = [
"ssl",
] }
shlex = "=1.3.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document why "=1.3.0"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just what cargo selected when I did cargo add shlex. How do you suggest I update it?

@sylvestre
Copy link
Collaborator

could you please document NVCC_PREPEND_FLAGS and NVCC_APPEND_FLAGS ?
Also, this is a big change with 47 commits which can't be merged as it.

Could you please split the work into smaller PR?

@@ -390,11 +402,38 @@ where
arg
);
}

let use_preprocessor_cache_mode = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, this change seems to be independent from nvcc

Copy link
Contributor Author

@trxcllnt trxcllnt Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may seem that way, but unfortunately it is not.

The generated subcommands are run with SCCACHE_DIRECT set to false, because the preprocessed output is either impossible (or unnecessary) to cache, so attempting to cache it is just wasted cycles and messier logs.

It didn't seem like there was a good way to disable the preprocessor caching besides the envvar, but then the issue is that this envvar is only read on startup and not per file, so this change ensures it's read per file.

)
)?;

let force_no_cache = env_vars
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is also related to this feature, and I left an explanation in this comment.

This behavior ensures we don't cache the same object file twice under two different hashes.

In the case where a call like nvcc -c x.cu -o x.cu.o is cached, x.cu.o is stored at a hash computed from the inputs.

In the case that it's not cached, we generate a list of commands to produce x.cu.o, where the final command is the host compiler call that combines the fatbin + host code, i.e.:

<all other cicc, ptxas, fatbinary etc. calls>
gcc "$tmp/x.cudafe1.cpp" -o "x.cu.o"

We do want to run this host compiler call through sccache, so we can take advantage of distributed compilation, but doing that would normally cache x.cu.o with a hash computed from the gcc compilation flow.

Since this is the object we're going to cache under the original hash computed for the outer nvcc call, and these objects can be large since they include the device code for all archs, we should avoid caching it twice.

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Sep 2, 2024

could you please document NVCC_PREPEND_FLAGS and NVCC_APPEND_FLAGS ?

Sure, these envvars are documented in the CUDA compiler docs. Since they can potentially affect the hash (and the nvcc-generated subcommands), sccache needs to intercept and prepend/append them manually.

Could you please split the work into smaller PR?

This was my initial goal, but sticking to it was difficult. If I had to break it down to relatively isolated chunks, they could be:

  1. Server changes:
    • Refactor start_compile_task() into an async function
    • spawn the async task and do the stack unwinding in check_compiler()
    • refactor the Response body to be a future, and join the tokio task with the outer stream
      • This isn't strictly necessary, it was borne out of an attempt to solve the memory leak that occurs when sccache is built for the default Linux target using glibc. Only later did I realize the leak does not occur in the unknown-linux-musl target.
        I left it in because it does seem like the "correct" thing to do, and in the context of this PR, it does reduce the amount of memory that leaks.
  2. Add an SCCACHE_NO_CACHE envvar
  3. Check the SCCACHE_DIRECT envvar per file
  4. Refactor CompileCommand into a trait
    • add a SingleCompileCommand impl
    • update all the generate_compile_commands() functions to return the boxed trait
    • make CompileCommand::execute() accept a reference to SccacheService<T>
  5. The rest of this PR

The challenge I had was that if broken into separate PRs, none of these changes make much sense on their own, and (more importantly) may not even be the way they should be implemented.

For example I anticipate reviewer questions like, "why make CompileCommand into a trait when there's only one implementation?" Or, "why change CompileCommand::execute() to accept a reference to the SccacheService that isn't used?" The only answers would be, "this is necessary for the nvcc PR," but this may also not be the right way to implement this functionality.

Similarly for the SCCACHE_DIRECT and SCCACHE_NO_CACHE envvars, it's possible these changes aren't desired and the functionality would be better implemented e.g. by refactoring the compile pipeline to accept additional configuration options.

So in conclusion, my main goal was to get the feature fully implemented and PR'd, then discuss details like these with the context of why they're necessary. I am 100% OK with splitting this out into separate PRs if that makes things easier to merge, I just want to make sure there's agreement on the implementation details first.

@glandium
Copy link
Collaborator

Hi, thank you so much for the thorough work you’ve put into this PR! 🙌 The level of detail and the description you provided in your comments are super helpful to understand the scope and rationale behind the changes. Really appreciate the thought you’ve given to this!

My first thought here would be that it would help with the review process if the changes were broken down into separate commits based on the different features and fixes you’ve described. This doesn’t necessarily mean separate PRs, but organizing each distinct change into its own commit and force-pushing onto your branch would make it easier to go through and review each part more methodically. As it stands, the 47 commits make it a bit tricky to trace logical units of changes, and reviewing the full diff in one go is a tad overwhelming. Splitting these would definitely make the review process smoother.

That being said, coming from a place where I know nothing of nvcc nor why support for it was added, my understanding from what you wrote (and thanks again for the great description), nvcc seems more akin to a build system like make or cargo than a compiler like gcc or rustc. Given that, one question I have is whether it makes sense for sccache to support nvcc at all, considering it doesn't wrap other build systems like make or cargo. Just to clarify, I'm not suggesting that the individual tools nvcc invokes shouldn’t be supported. I'm more curious about nvcc itself and whether wrapping it makes sense for sccache. Are there specific features of nvcc that aren’t already handled by the commands it launches that would justify sccache wrapping it? It seems like nvcc invokes gcc from $PATH, allowing /some/ form of wrapping by altering $PATH, does it do the same for the other tools or does it derive their path from where nvcc is?

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Sep 20, 2024

organizing each distinct change into its own commit and force-pushing onto your branch

Sure, I don't mind squashing the intermediate commits into logical units.

coming from a place where I know nothing of nvcc nor why support for it was added
nvcc seems more akin to a build system like make or cargo than a compiler like gcc or rustc

nvcc is the NVIDIA CUDA Compiler Driver, one of two (official) ways to compile CUDA C/C++ sources to binaries that run on NVIDIA GPUs.

nvcc is definitely not a build system, though I can understand why it might seem like one.

nvcc's design is spiritually similar to other compiler drivers like gcc/g++:

  • gcc delegates compilation to cc1, assembly to ar, and linking to ld.
  • nvcc delegates GPU device compilation to cicc, and device assembly to ptxas, and device linking to fatbinary.

nvcc also relies on a host compiler (gcc, clang, msvc, nvc, etc.) to compile the host CPU code into a single object file that contains the embedded device code (conceptually, this is just another step in the driver's compilation DAG).

Just like other compiler drivers, nvcc's basic run mode is source input -> binary output, and is thus compatible with compiler caching via ccache/sccache.

I'm not suggesting that the individual tools nvcc invokes shouldn’t be supported. I'm more curious about nvcc itself and whether wrapping it makes sense for sccache. Are there specific features of nvcc that aren’t already handled by the commands it launches that would justify sccache wrapping it?

Just like with gcc and g++, users don't typically invoke the underlying device compiler/assembler/linker directly (and nvcc is the only tool that can reliably generate valid calls to them for any given CUDA toolkit version).


I hope those answers help your understanding, but from your line of questioning I sense there's some deeper context I can provide as justification for this PR.

I may not have made this clear from my issue and PR description, but the fundamental reason sccache-dist can distribute compilations for gcc/clang, but not nvcc, is because the former support compiling preprocessed input.

sccache-dist is based on the assumption that it can preprocess a source file on the client, then send and compile the preprocessed file on a machine in the build cluster.

nvcc does not support this run mode, because there is no single -E output it can produce that is valid for all possible GPU architectures the user may want to target. It must preprocess the source file individually for each architecture, then compile and assemble that into the arch-specific binary.

A very rough analogy is like if gcc supported compiling universal multiarch binaries for x86/ARM/RISCV. Theoretically the preprocessor output would be different for each architecture, so doing gcc ... -E x.c | gcc -c -o x.o - wouldn't make any sense.

So if sccache depends on the ability to compile preprocessed input, and nvcc literally cannot do that, we're left with two options:

  1. sccache can only be used to cache nvcc compilations, but must always compile locally (the status quo)
  2. sccache can decompose the nvcc compiler driver into its constituent compilation and assembly stages, which do conform to sccache's requirements for distributed compilation (this PR)

The past few years we (the NVIDIA RAPIDS team) have seen huge improvements from shared build caches just using sccache + nvcc in local mode. Now we're trying to take it to the next level by deploying a distributed build cluster for devs/CI, and sccache-dist lacking support for nvcc is the main blocker to achieving that.

@glandium
Copy link
Collaborator

but the fundamental reason sccache-dist can distribute compilations for gcc/clang, but not nvcc, is because the former support compiling preprocessed input.

The fundamental reason sccache-dist preprocesses input is that it's the most efficient way to ensure everything required for compilation, aside from the toolchain, is available on the host where the job is distributed. If we could assume that the toolchain and all system headers were available there, at the right location, preprocessing wouldn't be necessary in the first place. That nvcc doesn't preprocess isn't the key issue. What's more relevant, though, is that this same constraint likely applies to some if not all of the commands it runs under the hood.

Preprocessing is actually more crucial to the caching part, and the way nvcc seems to operate makes me wonder if the current caching sccache does for nvcc is fully sound. Without looking into the exact implementation, I'm concerned there could be flaws, for example, when modifying headers that nvcc-invoked gcc calls rely on. Or when the gcc version changes. Such hidden issues might not surface during typical development cycles, but could lead to unexpected results at the most unexpected moments.

This is the first thing that worries me, and entrenching the wrapping of nvcc, and the consequences that follow (need for re-entrancy for your use case) are not really enticing.

The second thing that worries me is that relying on the output from nvcc --dry-run to infer what commands it would run sounds brittle. Does it handle escaping reliably? Will sccache be able to parse those commands correctly, or could edge cases slip through due to inadequate or missing escaping?

Which brings me back to earlier questions: does nvcc also do anything unique itself, or is everything handled through the subcommands it executes? Does it call all its subcommands via $PATH or via absolute paths it derives from its own location?

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Sep 24, 2024

The fundamental reason sccache-dist preprocesses input is that it's the most efficient way to ensure everything required for compilation, aside from the toolchain, is available on the host where the job is distributed. If we could assume that the toolchain and all system headers were available there, at the right location, preprocessing wouldn't be necessary in the first place.

Yes, I am aware how sccache-dist works and why.

That nvcc doesn't preprocess isn't the key issue.

It is impractical, undesired, and out of scope of this PR to remove the client-preprocessing step from sccache-dist. For all intents and purposes, this is the reason nvcc compilations cannot presently be distributed.

What's more relevant, though, is that this same constraint likely applies to some if not all of the commands it runs under the hood.

Please review the steps in the PR description above the mermaid diagram, specifically steps 2, 3, and 4. These represent the internal host compiler and CUDA front-end (cudafe++) preprocessing steps. The implication of these commands is that the input to cicc is the device-architecture-specific preprocessed form of the source file, and is suitable both for computing the PTX file's hash, and for distributing the cicc call. Similarly, the result PTX file can be used to compute the hash for the assembled cubin, and for distributing the ptxas call.

Preprocessing is actually more crucial to the caching part, and the way nvcc seems to operate makes me wonder if the current caching sccache does for nvcc is fully sound.

Current sccache (master branch) nvcc caching is almost fully sound, with one possible edge case that to my knowledge we've never hit in practice. That said, the changes in this branch could actually resolve that too, which I will describe below.

Without looking into the exact implementation, I'm concerned there could be flaws, for example, when modifying headers that nvcc-invoked gcc calls rely on.

gcc headers are included in the nvcc -E output since they are used for the host-side (CPU) compilation, thus will be considered in the object hash.

Or when the gcc version changes.

This is the edge case. Technically sccache should consider both nvcc --version and <host compiler> --version when computing the hash for an nvcc compilation.

It currently doesn't, because that requires predicting which host compiler nvcc would choose when the -ccbin flag is omitted. The most reliable way for sccache to do this would be executing nvcc --dryrun like in this PR, but I assume that wasn't done because it was more effort than its worth.

Practically speaking, changing host compiler version either involves changing a flag (nvcc -ccbin), or results in the host compiler's headers being different from other versions, both of which lead to different computed hashes.

However, this PR can ensure that never happens. By decomposing the nvcc call into its constituent commands, the final host compilation can run through sccache like any other compilation, at which point sccache considers its version in the computed hash.

This is the first thing that worries me, and entrenching the wrapping of nvcc, and the consequences that follow (need for re-entrancy for your use case) are not really enticing.

Here we agree. This approach was neither pleasant to consider nor implement, but in my analysis represented the smallest ratio of sccache changes to new features.

As a quick reminder, these are the new user-facing features enabled by this PR:

  1. Caching individual PTX/cubin compilations
  2. Parallel device compilations are now scheduled via the jobserver (when executed via sccache nvcc ...)
  3. Distributed nvcc compilations via sccache-dist

The first two are available to sccache clients running in local-compilation mode, and were consequences of the changes necessary to distribute nvcc compilations. However, they are huge quality-of-life improvements for relatively common scenarios encountered by users, even without distributed compilation.

Integrating software that wasn't designed with my use-case in mind often involves compromises, and selecting the "least-bad" option from a set of bad options is sometimes necessary. I am absolutely open to suggestions on alternative implementations if you think there's something I've missed.

The second thing that worries me is that relying on the output from nvcc --dry-run to infer what commands it would run sounds brittle.

Yes, it is brittle. However, other tools rely on --dryrun output's stability, and we have tested the logic in this PR is valid from CUDA Toolkit v9.0-present.

I am working with the NVIDIA compiler team to add a feature to future nvcc versions to produce its compilation DAG in a well-known format (e.g. graphviz dot) rather than shell format. When that lands, we can revisit the implementation here and use the nvcc DAG directly, rather than parsing one from the --dryrun output.

Does it handle escaping reliably?

Yes.

Will sccache be able to parse those commands correctly, or could edge cases slip through due to inadequate or missing escaping?

That depends on the parser. I chose shlex, which allows us to error early if the lines cannot be parsed.

does nvcc also do anything unique itself, or is everything handled through the subcommands it executes?

I believe invoking nvcc with the --threads option may instruct the underlying commands to run multi-threaded, or the CUDA compiler team reserves the right to make that change in the future, but I am not aware of anything else it does that isn't captured in the --dryrun output.

I briefed the CUDA compiler team on our plans to implement this feature, and they didn't raise any issues with the current approach. They are also aware of this PR, and we are working on plans to ensure these features are both stable and easier to achieve (e.g. enabling nvcc to output a DAG in a well-known format).

Does it call all its subcommands via $PATH or via absolute paths it derives from its own location?

How nvcc locates its binaries depends on many factors, however the source of truth is in the --dryrun output. This is represented in step 1 in the PR description above the mermaid diagram. nvcc --dryrun prints a list of environment variables that need to be set in order to execute the commands that follow, and this PR parses those lines to add or update the list of environment variables used when re-entering sccache for each subcommand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supporting nvcc in sccache-dist
4 participants