Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Allow vllm to still work if triton is not installed. #6786

Merged
merged 18 commits into from
Jul 29, 2024

Conversation

tdoublep
Copy link
Member

@tdoublep tdoublep commented Jul 25, 2024

We are currently needing to add triton as a dependency to all of the non-CUDA backends. This is because importing triton is still performed in various places throughout the library regardless of the backend.

This PR adds a function maybe_import_triton which will check to see if Triton is available in the environment. If it is not, it will replace Triton with a mocked up version that allows all the vLLM code to be imported.

An alternative approach might be to try to make the import conditional on the device. I have a feeling that would introduce a fair amount of additional complexity (e.g., we would need to only import triton after the engine has been constructed) but haven't worked through it in full.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@DarkLight1337 DarkLight1337 requested a review from youkaichao July 25, 2024 12:59
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@comaniac
Copy link
Collaborator

I feel this approach is a bit hacky. In general, we should avoid import triton kernels when triton is not available. If we really have to mock triton, is it possible to keep the API compatible so that we don't need version != 0.0.0 guard?

@tdoublep
Copy link
Member Author

tdoublep commented Jul 25, 2024

If we really have to mock triton, is it possible to keep the API compatible so that we don't need version != 0.0.0 guard?

@comaniac that guard is actually there to keep the code in when using the mock (or triton >= 2.1.0). the mock is already compatible. we could just remove the guard entirely imo because the triton version should be controlled via the requirement.txt

I will try a few things to see what it would take to do it without the mock entirely. My initial thought was that the mock would be the least-intrusive way to achieve this.

tdoublep added 2 commits July 26, 2024 04:18
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@tdoublep tdoublep force-pushed the fix-triton-import branch from 90fe0bf to f778898 Compare July 26, 2024 08:18
tdoublep added 2 commits July 26, 2024 04:21
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@tdoublep
Copy link
Member Author

@comaniac I did another pass through this. Changes are:

  • There is no longer any need to mock Triton.
  • Instead, I just took care to import the modules that contain Triton code only if Triton is available.
  • I decided to factor the Fp8MoEMethod into its own file to make the conditional import a bit cleaner. It can be kept inside fp8.py but then would need to be inside an if HAS_TRITON block which creates ugly indentation imo.
  • Aside from that all changes are minimal

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@tdoublep
Copy link
Member Author

/ready

@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 26, 2024
tdoublep added 3 commits July 26, 2024 06:33
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much clean to me. Thanks!

@@ -239,188 +241,6 @@ def apply(self,
use_per_token_if_dynamic=False)


class Fp8MoEMethod(FusedMoEMethodBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep Fp8MoEMethod in fp8.py instead of creating another file. Can we just lazy import fused_moe like UnquantizedFusedMoEMethod (https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L91) does?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good point. i thought it was more complicated because we need to inherit from FusedMoEMethodBase but actually that base class doesn't involve any Triton import. have pushed the change

MAX_TRITON_N_COLS = 131072


def get_num_triton_sampler_splits(n_cols: int) -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off topic: @Yard1 I feel this should be a general function for all triton kernels instead of just sampler. Do you think it makes sense to rename it to get_num_triton_input_chunks so something similar, and use it here as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would make sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify: is this something you'd like to have addressed in this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not necessary. We can merge this PR first.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. cc @robertgshaw2-neuralmagic

tdoublep added 5 commits July 26, 2024 13:59
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@comaniac comaniac merged commit 9a7e2d0 into vllm-project:main Jul 29, 2024
72 checks passed
tjohnson31415 added a commit to tjohnson31415/vllm that referenced this pull request Jul 30, 2024
* upstream/main: (66 commits)
  [Bugfix] Fix PaliGemma MMP (vllm-project#6930)
  [TPU] Fix greedy decoding (vllm-project#6933)
  [Kernel] Tuned int8 kernels for Ada Lovelace (vllm-project#6848)
  [Kernel] Fix marlin divide-by-zero warnings (vllm-project#6904)
  [ci] GHA workflow to remove ready label upon "/notready" comment (vllm-project#6921)
  [Kernel] Remove unused variables in awq/gemm_kernels.cu (vllm-project#6908)
  [Frontend] New `allowed_token_ids` decoding request parameter (vllm-project#6753)
  [Bugfix] Allow vllm to still work if triton is not installed. (vllm-project#6786)
  [TPU] Support tensor parallelism in async llm engine (vllm-project#6891)
  [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (vllm-project#6901)
  [Core] Reduce unnecessary compute when logprobs=None (vllm-project#6532)
  [Kernel] Tuned FP8 Kernels for Ada Lovelace (vllm-project#6677)
  [Model] Initialize support for InternVL2 series models (vllm-project#6514)
  [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 (vllm-project#6871)
  Add Nemotron to PP_SUPPORTED_MODELS (vllm-project#6863)
  [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (vllm-project#6795)
  [TPU] Reduce compilation time & Upgrade PyTorch XLA version  (vllm-project#6856)
  [Docs] Add RunLLM chat widget (vllm-project#6857)
  [Model] Initial support for BLIP-2 (vllm-project#5920)
  [CI/Build][Doc] Update CI and Doc for VLM example changes (vllm-project#6860)
  ...
Duyi-Wang pushed a commit to Duyi-Wang/vllm that referenced this pull request Aug 1, 2024
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
…roject#6786)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Alvant <alvasian@yandex.ru>
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants