[Bugfix] Allow vllm to still work if triton is not installed. #6786

tdoublep · 2024-07-25T12:45:54Z

We are currently needing to add triton as a dependency to all of the non-CUDA backends. This is because importing triton is still performed in various places throughout the library regardless of the backend.

This PR adds a function maybe_import_triton which will check to see if Triton is available in the environment. If it is not, it will replace Triton with a mocked up version that allows all the vLLM code to be imported.

An alternative approach might be to try to make the import conditional on the device. I have a feeling that would introduce a fair amount of additional complexity (e.g., we would need to only import triton after the engine has been constructed) but haven't worked through it in full.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

github-actions · 2024-07-25T12:46:08Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

vllm/attention/ops/prefix_prefill.py

vllm/model_executor/layers/fused_moe/fused_moe.py

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

comaniac · 2024-07-25T15:59:25Z

I feel this approach is a bit hacky. In general, we should avoid import triton kernels when triton is not available. If we really have to mock triton, is it possible to keep the API compatible so that we don't need version != 0.0.0 guard?

tdoublep · 2024-07-25T18:31:33Z

If we really have to mock triton, is it possible to keep the API compatible so that we don't need version != 0.0.0 guard?

@comaniac that guard is actually there to keep the code in when using the mock (or triton >= 2.1.0). the mock is already compatible. we could just remove the guard entirely imo because the triton version should be controlled via the requirement.txt

I will try a few things to see what it would take to do it without the mock entirely. My initial thought was that the mock would be the least-intrusive way to achieve this.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2024-07-26T08:34:59Z

@comaniac I did another pass through this. Changes are:

There is no longer any need to mock Triton.
Instead, I just took care to import the modules that contain Triton code only if Triton is available.
I decided to factor the Fp8MoEMethod into its own file to make the conditional import a bit cleaner. It can be kept inside fp8.py but then would need to be inside an if HAS_TRITON block which creates ugly indentation imo.
Aside from that all changes are minimal

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2024-07-26T09:45:10Z

/ready

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

comaniac

It looks much clean to me. Thanks!

comaniac · 2024-07-26T16:36:36Z

vllm/model_executor/layers/quantization/fp8.py

@@ -239,188 +241,6 @@ def apply(self,
            use_per_token_if_dynamic=False)


-class Fp8MoEMethod(FusedMoEMethodBase):


I'd prefer to keep Fp8MoEMethod in fp8.py instead of creating another file. Can we just lazy import fused_moe like UnquantizedFusedMoEMethod (https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L91) does?

yeah good point. i thought it was more complicated because we need to inherit from FusedMoEMethodBase but actually that base class doesn't involve any Triton import. have pushed the change

comaniac · 2024-07-26T16:48:51Z

vllm/triton_utils/sample.py

+MAX_TRITON_N_COLS = 131072
+
+
+def get_num_triton_sampler_splits(n_cols: int) -> int:


Off topic: @Yard1 I feel this should be a general function for all triton kernels instead of just sampler. Do you think it makes sense to rename it to get_num_triton_input_chunks so something similar, and use it here as well?

I think that would make sense

Just to clarify: is this something you'd like to have addressed in this PR?

No it's not necessary. We can merge this PR first.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

comaniac

LGTM. cc @robertgshaw2-neuralmagic

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* upstream/main: (66 commits) [Bugfix] Fix PaliGemma MMP (vllm-project#6930) [TPU] Fix greedy decoding (vllm-project#6933) [Kernel] Tuned int8 kernels for Ada Lovelace (vllm-project#6848) [Kernel] Fix marlin divide-by-zero warnings (vllm-project#6904) [ci] GHA workflow to remove ready label upon "/notready" comment (vllm-project#6921) [Kernel] Remove unused variables in awq/gemm_kernels.cu (vllm-project#6908) [Frontend] New `allowed_token_ids` decoding request parameter (vllm-project#6753) [Bugfix] Allow vllm to still work if triton is not installed. (vllm-project#6786) [TPU] Support tensor parallelism in async llm engine (vllm-project#6891) [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (vllm-project#6901) [Core] Reduce unnecessary compute when logprobs=None (vllm-project#6532) [Kernel] Tuned FP8 Kernels for Ada Lovelace (vllm-project#6677) [Model] Initialize support for InternVL2 series models (vllm-project#6514) [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 (vllm-project#6871) Add Nemotron to PP_SUPPORTED_MODELS (vllm-project#6863) [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (vllm-project#6795) [TPU] Reduce compilation time & Upgrade PyTorch XLA version (vllm-project#6856) [Docs] Add RunLLM chat widget (vllm-project#6857) [Model] Initial support for BLIP-2 (vllm-project#5920) [CI/Build][Doc] Update CI and Doc for VLM example changes (vllm-project#6860) ...

…roject#6786) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

…roject#6786) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Alvant <alvasian@yandex.ru>

…roject#6786) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Allow vllm to still work if triton is not installed.

ea4de02

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Merge branch 'main' into fix-triton-import

cb939d4

DarkLight1337 requested a review from youkaichao July 25, 2024 12:59

Reduce diff

5e5f6ec

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep commented Jul 25, 2024

View reviewed changes

vllm/attention/ops/prefix_prefill.py Outdated Show resolved Hide resolved

tdoublep commented Jul 25, 2024

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved

Fix in custom_cache_manager.

e80fc34

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep added 2 commits July 26, 2024 04:18

Rework without the mock

e136740

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Separate FP8 FusedMoE into separate module

f778898

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep force-pushed the fix-triton-import branch from 90fe0bf to f778898 Compare July 26, 2024 08:18

tdoublep added 2 commits July 26, 2024 04:21

Fix conflict

8777cf2

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Fix type checking

0039ba3

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Remove additional redundant Triton deps.

b35adce

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 26, 2024

tdoublep added 3 commits July 26, 2024 06:33

Move get_num_triton_sampler_splits into triton_utils

3f94f68

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

fmt

d1a4c51

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Fix error in sampler test

e1cff0a

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

comaniac reviewed Jul 26, 2024

View reviewed changes

put fused_moe back in fp8.py

9816ac7

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

comaniac approved these changes Jul 26, 2024

View reviewed changes

tdoublep added 5 commits July 26, 2024 13:59

Improved handling of ruff in __init__.py

b9bb0b4

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Fix small bug introduced.

936290d

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Resolve conflict

1e41679

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Fix new (minor) conflict.

c753e00

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Merge branch 'main' into fix-triton-import

0948708

comaniac merged commit 9a7e2d0 into vllm-project:main Jul 29, 2024
72 checks passed

Duyi-Wang pushed a commit to Duyi-Wang/vllm that referenced this pull request Aug 1, 2024

[Bugfix] Allow vllm to still work if triton is not installed. (vllm-p…

c05e705

…roject#6786) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

dtrifiro mentioned this pull request Aug 5, 2024

Sync with upstream@v0.5.4-7-g9118217f opendatahub-io/vllm#120

Closed

tomeras91 mentioned this pull request Aug 11, 2024

[CI/Build] build on empty device for better dev experience #4773

Merged

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[Bugfix] Allow vllm to still work if triton is not installed. (vllm-p…

875b145

…roject#6786) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Bugfix] Allow vllm to still work if triton is not installed. (vllm-p…

acb14eb

…roject#6786) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Alvant <alvasian@yandex.ru>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Bugfix] Allow vllm to still work if triton is not installed. (vllm-p…

4fe0c07

…roject#6786) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Allow vllm to still work if triton is not installed. #6786

[Bugfix] Allow vllm to still work if triton is not installed. #6786

tdoublep commented Jul 25, 2024 •

edited

Loading

github-actions bot commented Jul 25, 2024

comaniac commented Jul 25, 2024

tdoublep commented Jul 25, 2024 •

edited

Loading

tdoublep commented Jul 26, 2024

tdoublep commented Jul 26, 2024

comaniac left a comment

comaniac Jul 26, 2024

tdoublep Jul 26, 2024

comaniac Jul 26, 2024

Yard1 Jul 26, 2024

tdoublep Jul 29, 2024

comaniac Jul 29, 2024

comaniac left a comment

		@@ -239,188 +241,6 @@ def apply(self,
		use_per_token_if_dynamic=False)


		class Fp8MoEMethod(FusedMoEMethodBase):

		MAX_TRITON_N_COLS = 131072


		def get_num_triton_sampler_splits(n_cols: int) -> int:

[Bugfix] Allow vllm to still work if triton is not installed. #6786

[Bugfix] Allow vllm to still work if triton is not installed. #6786

Conversation

tdoublep commented Jul 25, 2024 • edited Loading

github-actions bot commented Jul 25, 2024

comaniac commented Jul 25, 2024

tdoublep commented Jul 25, 2024 • edited Loading

tdoublep commented Jul 26, 2024

tdoublep commented Jul 26, 2024

comaniac left a comment

Choose a reason for hiding this comment

comaniac Jul 26, 2024

Choose a reason for hiding this comment

tdoublep Jul 26, 2024

Choose a reason for hiding this comment

comaniac Jul 26, 2024

Choose a reason for hiding this comment

Yard1 Jul 26, 2024

Choose a reason for hiding this comment

tdoublep Jul 29, 2024

Choose a reason for hiding this comment

comaniac Jul 29, 2024

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

tdoublep commented Jul 25, 2024 •

edited

Loading

tdoublep commented Jul 25, 2024 •

edited

Loading