Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dockerfile.ubi: use CUDA 12.1 #130

Closed
wants to merge 6 commits into from

Conversation

dtrifiro
Copy link

  • Dockerfile.ubi: use CUDA 12.1 instead of 12.4

Copy link

openshift-ci bot commented Aug 13, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

openshift-ci bot commented Aug 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dtrifiro

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dtrifiro
Copy link
Author

/test all

@dtrifiro dtrifiro mentioned this pull request Aug 13, 2024
@dtrifiro
Copy link
Author

@maxdebayser as you can see from CI, the build fails with CUDA 12.1, here's the relevant message from the logs:

FAILED: CMakeFiles/_C.dir/csrc/quantization/cutlass_w8a8/scaled_mm_c2x.cu.o 
ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DPy_LIMITED_API=3 -DTORCH_EXTENSION_NAME=_C -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -D_C_EXPORTS -I/workspace/csrc -I/workspace/build/temp.linux-x86_64-cpython-311/_deps/cutlass-src/include -isystem /usr/include/python3.11 -isystem /opt/vllm/lib64/python3.11/site-packages/torch/include -isystem /opt/vllm/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++17 "--generate-code=arch=compute_70,code=[sm_70]" "--generate-code=arch=compute_75,code=[sm_75]" "--generate-code=arch=compute_80,code=[sm_80]" "--generate-code=arch=compute_86,code=[sm_86]" "--generate-code=arch=compute_89,code=[sm_89]" "--generate-code=arch=compute_90,code=[sm_90]" "--generate-code=arch=compute_90,code=[compute_90]" -Xcompiler=-fPIC --expt-relaxed-constexpr -DENABLE_FP8 --threads=2 -D_GLIBCXX_USE_CXX11_ABI=0 -MD -MT CMakeFiles/_C.dir/csrc/quantization/cutlass_w8a8/scaled_mm_c2x.cu.o -MF CMakeFiles/_C.dir/csrc/quantization/cutlass_w8a8/scaled_mm_c2x.cu.o.d -x cu -c /workspace/csrc/quantization/cutlass_w8a8/scaled_mm_c2x.cu -o CMakeFiles/_C.dir/csrc/quantization/cutlass_w8a8/scaled_mm_c2x.cu.o
[27/32] Building CUDA object CMakeFiles/_C.dir/csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu.o
[28/32] Building CUDA object CMakeFiles/_C.dir/csrc/attention/attention_kernels.cu.o
[29/32] Building CUDA object CMakeFiles/_C.dir/csrc/quantization/gptq_marlin/gptq_marlin.cu.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/workspace/setup.py", line 456, in <module>
    setup(
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/__init__.py", line 87, in setup
    return distutils.core.setup(**attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/_distutils/core.py", line 185, in setup
    return run_commands(dist)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
    dist.run_commands()
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/_distutils/dist.py", line 968, in run_commands
    self.run_command(cmd)
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
    cmd_obj.run()
  File "/opt/vllm/lib64/python3.11/site-packages/wheel/_bdist_wheel.py", line 378, in run
    self.run_command("build")
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
    self.distribution.run_command(command)
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
    cmd_obj.run()
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/_distutils/command/build.py", line 132, in run
    self.run_command(cmd_name)
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
    self.distribution.run_command(command)
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
    cmd_obj.run()
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/command/build_ext.py", line 84, in run
    _build_ext.run(self)
  File "/opt/vllm/lib64/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
    self.build_extensions()
  File "/workspace/setup.py", line 231, in build_extensions
    subprocess.check_call(["cmake", *build_args], cwd=self.build_temp)
  File "/usr/lib64/python3.11/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)

@dtrifiro
Copy link
Author

As a note:
Upstream dockerfile uses 12.4: https://github.com/opendatahub-io/vllm/blob/main/Dockerfile#L8
although this is overridden for release wheels https://github.com/opendatahub-io/vllm/blob/main/.buildkite/release-pipeline.yaml#L6

@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

openshift-ci bot commented Sep 3, 2024

@dtrifiro: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/smoke-test 0703e45 link true /test smoke-test
ci/prow/pr-image-mirror 0703e45 link true /test pr-image-mirror
ci/prow/images 0703e45 link true /test images
ci/prow/rocm-pr-image-mirror 0703e45 link true /test rocm-pr-image-mirror
ci/prow/cuda-pr-image-mirror 0703e45 link true /test cuda-pr-image-mirror

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dtrifiro
Copy link
Author

Closing as stale

@dtrifiro dtrifiro closed this Sep 12, 2024
@dtrifiro dtrifiro deleted the set-cuda-to-12.1 branch September 12, 2024 09:32
prarit pushed a commit to prarit/vllm that referenced this pull request Oct 18, 2024
* First version

* Revert error.

While there, add missing finalize.

* Use the correct defaults for ROCm.

Increase sampling area to capture crossover.

* Scope end_sync as well.

* Guard only volatile keyword for ifndef USE_ROCM

* Document crossover
prarit pushed a commit to prarit/vllm that referenced this pull request Oct 18, 2024
* Per @iotamudelta suggestion until the deadlocks issue is better understood
Revert "Make CAR ROCm 6.1 compatible. (opendatahub-io#137)"

This reverts commit 4d2dda6.

* Per @iotamudelta suggestion until the deadlocks issue is better understood
Revert "Optimize custom all reduce (opendatahub-io#130)"

This reverts commit 636ff01.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants