Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use dynamic CUDA wheels on CUDA 11 #2548

Merged
merged 3 commits into from
Jan 22, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 9 additions & 18 deletions ci/build_wheel.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,24 +21,15 @@ rapids-generate-version > ./VERSION

cd "${package_dir}"

case "${RAPIDS_CUDA_VERSION}" in
12.*)
EXCLUDE_ARGS=(
--exclude "libcublas.so.12"
--exclude "libcublasLt.so.12"
--exclude "libcurand.so.10"
--exclude "libcusolver.so.11"
--exclude "libcusparse.so.12"
--exclude "libnvJitLink.so.12"
--exclude "libucp.so.0"
)
;;
11.*)
EXCLUDE_ARGS=(
--exclude "libucp.so.0"
)
;;
esac
EXCLUDE_ARGS=(
--exclude "libcublas.so.*"
--exclude "libcublasLt.so.*"
--exclude "libcurand.so.*"
--exclude "libcusolver.so.*"
--exclude "libcusparse.so.*"
--exclude "libnvJitLink.so.*"
--exclude "libucp.so.*"
)

if [[ ${package_name} != "libraft" ]]; then
EXCLUDE_ARGS+=(
Expand Down
11 changes: 0 additions & 11 deletions ci/build_wheel_libraft.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,16 +28,5 @@ export PIP_NO_BUILD_ISOLATION=0

RAPIDS_PY_CUDA_SUFFIX="$(rapids-wheel-ctk-name-gen ${RAPIDS_CUDA_VERSION})"

case "${RAPIDS_CUDA_VERSION}" in
12.*)
EXTRA_CMAKE_ARGS="-DUSE_CUDA_MATH_WHEELS=ON"
;;
11.*)
EXTRA_CMAKE_ARGS="-DUSE_CUDA_MATH_WHEELS=OFF"
;;
esac

export SKBUILD_CMAKE_ARGS="${EXTRA_CMAKE_ARGS}"

ci/build_wheel.sh libraft ${package_dir} cpp
ci/validate_wheel.sh ${package_dir} final_dist libraft
15 changes: 0 additions & 15 deletions ci/validate_wheel.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,27 +9,12 @@ package_name=$3

RAPIDS_CUDA_MAJOR="${RAPIDS_CUDA_VERSION%%.*}"

# some packages are much larger on CUDA 11 than on CUDA 12
PYDISTCHECK_ARGS=()
if [[ "${package_name}" == "libraft" ]]; then
if [[ "${RAPIDS_CUDA_MAJOR}" == "11" ]]; then
PYDISTCHECK_ARGS+=(
--max-allowed-size-compressed '750M'
)
else
PYDISTCHECK_ARGS+=(
--max-allowed-size-compressed '100M'
)
fi
fi

cd "${package_dir}"

rapids-logger "validate packages with 'pydistcheck'"

pydistcheck \
--inspect \
"${PYDISTCHECK_ARGS[@]}" \
"$(echo ${wheel_dir_relative_path}/*.whl)"

rapids-logger "validate packages with 'twine'"
Expand Down
5 changes: 4 additions & 1 deletion dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -358,11 +358,14 @@ dependencies:
- nvidia-curand-cu12
- nvidia-cusolver-cu12
- nvidia-cusparse-cu12
# CUDA 11 does not provide wheels, so use the system libraries instead
- matrix:
cuda: "11.*"
use_cuda_wheels: "true"
packages:
- nvidia-cublas-cu11
- nvidia-curand-cu11
- nvidia-cusolver-cu11
- nvidia-cusparse-cu11
# if use_cuda_wheels=false is provided, do not add dependencies on any CUDA wheels
# (e.g. for DLFW and pip devcontainers)
- matrix:
Expand Down
31 changes: 11 additions & 20 deletions python/libraft/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,6 @@ project(
LANGUAGES CXX
)

option(USE_CUDA_MATH_WHEELS "Use the CUDA math wheels instead of the system libraries" OFF)

# Check if raft is already available. If so, it is the user's responsibility to ensure that the
# CMake package is also available at build time of the Python raft package.
find_package(raft "${RAPIDS_VERSION}")
Expand All @@ -35,14 +33,8 @@ endif()
unset(raft_FOUND)

# --- CUDA --- #
find_package(CUDAToolkit REQUIRED)
set(CUDA_STATIC_RUNTIME ON)
set(CUDA_STATIC_MATH_LIBRARIES ON)
if(CUDAToolkit_VERSION VERSION_GREATER_EQUAL 12.0)
set(CUDA_STATIC_MATH_LIBRARIES OFF)
elseif(USE_CUDA_MATH_WHEELS)
message(FATAL_ERROR "Cannot use CUDA math wheels with CUDA < 12.0")
endif()
set(CUDA_STATIC_MATH_LIBRARIES OFF)

# --- RAFT ---#
set(BUILD_TESTS OFF)
Expand All @@ -52,14 +44,13 @@ set(RAFT_COMPILE_LIBRARY ON)

add_subdirectory(../../cpp raft-cpp)

if(NOT CUDA_STATIC_MATH_LIBRARIES AND USE_CUDA_MATH_WHEELS)
set_property(
TARGET raft_lib
PROPERTY INSTALL_RPATH
"$ORIGIN/../nvidia/cublas/lib"
"$ORIGIN/../nvidia/curand/lib"
"$ORIGIN/../nvidia/cusolver/lib"
"$ORIGIN/../nvidia/cusparse/lib"
"$ORIGIN/../nvidia/nvjitlink/lib"
)
endif()
# assumes libraft.so is installed 2 levels deep, e.g. site-packages/libraft/lib64/libraft.so
set_property(
TARGET raft_lib
PROPERTY INSTALL_RPATH
"$ORIGIN/../../nvidia/cublas/lib"
"$ORIGIN/../../nvidia/curand/lib"
"$ORIGIN/../../nvidia/cusolver/lib"
"$ORIGIN/../../nvidia/cusparse/lib"
"$ORIGIN/../../nvidia/nvjitlink/lib"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice I've added an additional ../ in here... these RPATHs were wrong in #2531 (confirmed by inspecting the paths more closely tonight).

Also confirmed these will work for both CUDA 11 and CUDA 12 wheels.

I suspect we didn't notice this in CI before because the corresponding system-installed libraries were being found instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to ensure CI will fail in the future if these are incorrect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about that... there are options but I don't love any of them.

Ways I can think of to test that an RPATH actually ends up getting used:

  • set LD_DEBUG=libs LD_DEBUG_OUTPUT=/tmp/logs or something to generate debug logs from ld, then grep those to figure out if libraries were loaded from the expected place
  • ensure there aren't any more of these libraries on the system by deleting them (we used to do something similar for UCX)... if loading works, then it must be that the one remaining copy of e.g. libcublas.so was used
  • hide the those system locations by manipulating LD_LIBRARY_PATH, forcing ld to either find the wheel-installed libraries or raise an error
  • start a long-lived process that loads the library, then use lsof to check that the one from site-packages was loaded (https://superuser.com/a/310205/1798685

I definitely don't think we should hold up this PR over this question, but it would be helpful to find a way to catch things like this in CI if we could find a way that isn't too complex.

Copy link
Contributor

@bdice bdice Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be able to try building CI images without any system CUDA libraries? If we migrate all of RAPIDS to use CUDA wheels, that could be a step in the right direction, but it might also require other non-RAPIDS packages like cupy to use CUDA wheels, too. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we first put together the wheels and the images, I actually pushed for us to base the citestwheel images on a base image that did not contain the CUDA libraries. I don't know if that's easily discoverable in the history anywhere. In any case, @bdice's suspicion here is correct; this wound up not working because you couldn't use cupy (or numba) in those images since they require the system libraries.

4 changes: 3 additions & 1 deletion python/libraft/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,8 @@ matrix-entry = "cuda_suffixed=true;use_cuda_wheels=true"

[tool.pydistcheck]
select = [
# NOTE: size threshold is managed via CLI args in CI scripts
"distro-too-large-compressed",
]

# PyPI limit is 100 MiB, fail CI before we get too close to that
max_allowed_size_compressed = '75M'
Loading