Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model]: Add transformers backend support #11330

Merged
merged 105 commits into from
Feb 3, 2025
Merged
Show file tree
Hide file tree
Changes from 97 commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
0bb5519
Merge
ArthurZucker Dec 19, 2024
8e238f7
Merge branch 'main' into transformers-backend
ArthurZucker Dec 19, 2024
6d8f1fd
revert some changes
ArthurZucker Dec 19, 2024
fb37617
changes are now merged with main of transformers
ArthurZucker Dec 19, 2024
2d0c128
revert more changes
ArthurZucker Dec 19, 2024
31c16a1
Merge remote-tracking branch 'upstream/main' into fix-history
hmellor Jan 9, 2025
a49aa81
Undo whitespace changes
hmellor Jan 9, 2025
ae2e1cf
Merge remote-tracking branch 'upstream/main' into fix-history
hmellor Jan 13, 2025
ff19ade
Update transformers pin
hmellor Jan 13, 2025
038604b
Remove unreachable code
hmellor Jan 13, 2025
882ef81
Remove dead code
hmellor Jan 13, 2025
f254f2c
Update to latest attention interface
hmellor Jan 13, 2025
5a1a833
Always try to load `TransformersModel` if model isn't explicitly supp…
hmellor Jan 13, 2025
b7de34d
Temporarily remove Llama from registry
hmellor Jan 13, 2025
49c4616
Deduplicate registry code slightly
hmellor Jan 16, 2025
071246d
Fix profiling of Attentions
hmellor Jan 16, 2025
6190591
Run `./format.sh` on `transformers.py`
hmellor Jan 16, 2025
7ae8262
Fix spelling
hmellor Jan 16, 2025
988586d
Undo changes to `chat.py`
hmellor Jan 16, 2025
5313551
tests + md
ArthurZucker Jan 17, 2025
f127a03
test helium
ArthurZucker Jan 17, 2025
9baefd2
fix dtype issue
ArthurZucker Jan 17, 2025
aff205a
Make model implementation configurable
hmellor Jan 17, 2025
4efcac8
FIx previous commit
hmellor Jan 17, 2025
5d3afac
`format.sh`
hmellor Jan 17, 2025
20f4d48
Handle alternative vocab embed layer names
hmellor Jan 17, 2025
013f880
Undo removel of `LlamaForCausalLM`
hmellor Jan 17, 2025
19dc1f8
Add `RMSNorm` replacement
hmellor Jan 17, 2025
7b5f146
bnb and `SupportsLoRA`
ArthurZucker Jan 17, 2025
e1d1e33
:Merge branch 'fix-history' of github.com:ArthurZucker/vllm into fix-…
ArthurZucker Jan 17, 2025
c805f9d
Change log
hmellor Jan 17, 2025
aadfb1b
Formatting
hmellor Jan 17, 2025
544ba2d
Disable vLLM RMS Norm implementation for now
hmellor Jan 17, 2025
06347f8
Only throw TP error if user is trying to use TP
hmellor Jan 17, 2025
3fe40d1
Add some tests for TransformersModel
hmellor Jan 17, 2025
d37fd9b
remove replace norm, cleanup
ArthurZucker Jan 20, 2025
86dc357
Merge branch 'fix-history' of github.com:ArthurZucker/vllm into fix-h…
ArthurZucker Jan 20, 2025
4cbea32
linting and test mark
Isotr0py Jan 20, 2025
96f0a3a
revert example modification
Isotr0py Jan 20, 2025
91e6037
fix wrong llm.model
Isotr0py Jan 20, 2025
754124a
Merge remote-tracking branch 'upstream/main' into fix-history
Isotr0py Jan 20, 2025
554df59
use apply_model
Isotr0py Jan 20, 2025
319cf97
Update docs/source/models/supported_models.md
ArthurZucker Jan 20, 2025
88d679a
Merge branch 'main' into fix-history
ArthurZucker Jan 23, 2025
d346637
Update docs/source/models/supported_models.md
ArthurZucker Jan 23, 2025
0f15f09
move the check to normalized arch
ArthurZucker Jan 24, 2025
f4c41eb
Merge branch 'fix-history' of github.com:ArthurZucker/vllm into fix-h…
ArthurZucker Jan 24, 2025
2a4fc4f
fix
ArthurZucker Jan 24, 2025
ceabb51
revert try inspect changes
ArthurZucker Jan 28, 2025
50b218a
Update test
ArthurZucker Jan 28, 2025
c8aac87
style
ArthurZucker Jan 28, 2025
f6cb8fe
Merge branch 'main' into fix-history
ArthurZucker Jan 28, 2025
1896af7
style update
ArthurZucker Jan 28, 2025
b42e464
Merge branch 'fix-history' of github.com:ArthurZucker/vllm into fix-h…
ArthurZucker Jan 28, 2025
1983511
Merge branch 'main' of https://github.com/vllm-project/vllm into fix-…
ArthurZucker Jan 28, 2025
869934a
fix normalize arch
ArthurZucker Jan 28, 2025
df1c8b2
update test, fix gpu marker and remove trust remote as it's True by d…
ArthurZucker Jan 28, 2025
ffd6dce
update test
ArthurZucker Jan 28, 2025
9704287
for now use `model_config.hf_config.auto_map["AutoModel"]`
ArthurZucker Jan 28, 2025
9a871af
fix remote models
ArthurZucker Jan 28, 2025
4f33ff8
nits
ArthurZucker Jan 28, 2025
fc6a7e9
remove unused kwarg class
ArthurZucker Jan 28, 2025
0ab2f82
fix weight loading
ArthurZucker Jan 28, 2025
44f78ef
fix test
ArthurZucker Jan 28, 2025
4847836
update test!
ArthurZucker Jan 28, 2025
0b348e4
Nits
ArthurZucker Jan 28, 2025
2132dcf
update
ArthurZucker Jan 28, 2025
20bc901
remove print
ArthurZucker Jan 28, 2025
62540f2
update
ArthurZucker Jan 28, 2025
cfeaaae
Fix fallback, dict keys != attrs
hmellor Jan 28, 2025
ecf2990
cleanup
ArthurZucker Jan 28, 2025
8cbd02e
Merge branch 'fix-history' of github.com:ArthurZucker/vllm into fix-h…
ArthurZucker Jan 28, 2025
e30000d
pre-commit
ArthurZucker Jan 28, 2025
7fd638f
nit
ArthurZucker Jan 28, 2025
57c5dbf
Merge remote-tracking branch 'origin/main' into fix-history
hmellor Jan 28, 2025
e714c05
pre-commit
hmellor Jan 28, 2025
fc62d7d
Remove unused line
hmellor Jan 28, 2025
5475b5b
Remove `kv_caches` and update scale if it's passed
hmellor Jan 28, 2025
255ed6c
eager tests do work for now
ArthurZucker Jan 28, 2025
be6f244
Merge branch 'fix-history' of github.com:ArthurZucker/vllm into fix-h…
ArthurZucker Jan 28, 2025
4a855ea
Respond to comments
hmellor Jan 28, 2025
e416227
fix failing test on phi: not all remote code have AutoModel
ArthurZucker Jan 29, 2025
8e92304
Merge branch 'fix-history' of github.com:ArthurZucker/vllm into fix-h…
ArthurZucker Jan 29, 2025
7758ea2
Merge branch 'main' of https://github.com/vllm-project/vllm into fix-…
ArthurZucker Jan 29, 2025
b74886e
remove enforce eager for CI test
ArthurZucker Jan 29, 2025
5dabda8
remove BNB and LORA
ArthurZucker Jan 29, 2025
a1bd892
remove quantized test
ArthurZucker Jan 29, 2025
15327e3
update buildkite to run transformers test
ArthurZucker Jan 29, 2025
3fad390
Update vllm/model_executor/model_loader/utils.py
ArthurZucker Jan 30, 2025
5663a0c
fix pre-commit
ArthurZucker Jan 30, 2025
17c6e02
Merge branch 'fix-history' of github.com:ArthurZucker/vllm into fix-h…
ArthurZucker Jan 30, 2025
5679d4d
update
ArthurZucker Jan 30, 2025
03f1844
Fix failing registry test
hmellor Jan 30, 2025
d001748
temp: run transformers tests first
hmellor Jan 30, 2025
4741ab2
Update transformers pin in `requirements-test.txt`
hmellor Jan 30, 2025
9a29e46
update deps
ArthurZucker Jan 31, 2025
073ac5e
Merge branch 'fix-history' of github.com:ArthurZucker/vllm into fix-h…
ArthurZucker Jan 31, 2025
5f6668f
make v1 work
Isotr0py Feb 1, 2025
90be3b9
Merge branch 'main' into fix-history
Isotr0py Feb 1, 2025
95c1916
fix custom model test
Isotr0py Feb 1, 2025
2906626
fix incorrect backend fallback
Isotr0py Feb 1, 2025
8c33bd6
fix oot registration test
Isotr0py Feb 1, 2025
ccbff79
add transformers tp test
Isotr0py Feb 1, 2025
3647766
Update vllm/model_executor/model_loader/utils.py
Isotr0py Feb 2, 2025
f68af01
clean up
Isotr0py Feb 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -349,6 +349,7 @@ steps:
- vllm/
- tests/models
commands:
- pytest -v -s models/test_transformers.py
- pytest -v -s models/test_registry.py
- pytest -v -s models/test_initialization.py

Expand Down
76 changes: 76 additions & 0 deletions docs/source/models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,82 @@ If vLLM successfully returns text (for generative models) or hidden states (for
Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.

### Transformers fallback

After the merge of <gh-pr:11330>, `vllm` can fallback to models that are available in `transformers`. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!

To check if the backend is `transformers`, you can simply do this:

```python
from vllm import LLM
llm = LLM(model=..., task="generate") # Name or path of your model
llm.apply_model(lambda model: print(model.__class__))
```

If it is `TransformersModel` then it means it's based on `transformers`!

#### Supported features

##### LORA and quantization

Both are not supported yet! Make sure to open an issue and we'll work on this together with the `transformers` team!

Usually `transformers` model load weights via the `load_adapters` API, that depends on PEFT. We need to work a bit to either use this api (for now this would result in some weights not being marked as loaded) or replace modules accordingly.

Hints as to how this would look like:

```python
class TransformersModel(nn.Module, SupportsLoRA):
def __init__(*):
...
self.model.load_adapter(vllm_config.load_config.model_loader_extra_config["qlora_adapter_name_or_path"])
```

Blocker is that you need to specify supported lora layers, when we would ideally want to load whatever is inside the checkpoint!

##### Remote code

This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!

```python
from vllm import LLM
llm = LLM(model=..., task="generate", trust_remote_code=True) # Name or path of your model
llm.apply_model(lambda model: print(model.__class__))
```

A model just needs the following two things:

```python
from transformers import PreTrainedModel
from torch import nn

class MyAttention(nn.Module):

def forward(self, hidden_states, **kwargs): # <- kwargs are required

...
attention_interface = attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
attn_output, attn_weights = attention_interface(
self,
query_states,
key_states,
value_states,
**kwargs,
)
...

class MyModel(PreTrainedModel):
_supports_attention_backend = True
```

Here is what happens in the background:

1. The config is loaded
2. `MyModel` python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
3. The `TransformersModel` backend is used. See `/model_executors/models/transformers`, which leverage `self.config._attn_implementation = "vllm"`, thus the need to use `ALL_ATTENTION_FUNCTION`.

That's it!

### ModelScope

To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:
Expand Down
2 changes: 1 addition & 1 deletion requirements-common.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ requests >= 2.26.0
tqdm
blake3
py-cpuinfo
transformers >= 4.45.2 # Required for Llama 3.2 and Qwen2-VL.
transformers >= 4.48.2 # Required for Transformers model.
tokenizers >= 0.19.1 # Required for Llama 3.
protobuf # Required by LlamaTokenizer.
fastapi >= 0.107.0, < 0.113.0; python_version < '3.9'
Expand Down
2 changes: 1 addition & 1 deletion requirements-test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -617,7 +617,7 @@ tqdm==4.66.6
# transformers
tqdm-multiprocess==0.0.11
# via lm-eval
transformers==4.47.0
transformers==4.48.2
# via
# genai-perf
# lm-eval
Expand Down
5 changes: 5 additions & 0 deletions tests/models/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -279,12 +279,17 @@ def check_available_online(
speculative_model="ibm-fms/llama-160m-accelerator"), # noqa: E501
}

_FALLBACK_MODEL = {
"TransformersModel": _HfExamplesInfo("ArthurZ/Ilama-3.2-1B", trust_remote_code=True), # noqa: E501
}

_EXAMPLE_MODELS = {
**_TEXT_GENERATION_EXAMPLE_MODELS,
**_EMBEDDING_EXAMPLE_MODELS,
**_CROSS_ENCODER_EXAMPLE_MODELS,
**_MULTIMODAL_EXAMPLE_MODELS,
**_SPECULATIVE_DECODING_EXAMPLE_MODELS,
**_FALLBACK_MODEL,
}


Expand Down
75 changes: 75 additions & 0 deletions tests/models/test_transformers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
"""Test the functionality of the Transformers backend.
Run `pytest tests/models/test_transformers.py`.
"""
from contextlib import nullcontext
from typing import Type

import pytest

from ..conftest import HfRunner, VllmRunner
from ..utils import multi_gpu_test
from .utils import check_logprobs_close


def check_implementation(
hf_runner: Type[HfRunner],
vllm_runner: Type[VllmRunner],
example_prompts: list[str],
model: str,
**kwargs,
):
max_tokens = 32
num_logprobs = 5

with vllm_runner(model, **kwargs) as vllm_model:
vllm_outputs = vllm_model.generate_greedy_logprobs(
example_prompts, max_tokens, num_logprobs)

with hf_runner(model) as hf_model:
hf_outputs = hf_model.generate_greedy_logprobs_limit(
example_prompts, max_tokens, num_logprobs)

check_logprobs_close(
outputs_0_lst=hf_outputs,
outputs_1_lst=vllm_outputs,
name_0="hf",
name_1="vllm",
)


@pytest.mark.parametrize(
"model,model_impl",
[
("meta-llama/Llama-3.2-1B-Instruct", "transformers"),
("openai-community/gpt2", "transformers"),
("ArthurZ/Ilama-3.2-1B", "auto"), # CUSTOM CODE
("meta-llama/Llama-3.2-1B-Instruct", "auto"),
]) # trust_remote_code=True by default
def test_models(hf_runner, vllm_runner, example_prompts, model,
model_impl) -> None:

maybe_raises = nullcontext()
if model == "openai-community/gpt2" and model_impl == "transformers":
# Model is not backend compatible
maybe_raises = pytest.raises(
ValueError,
match="The Transformers implementation.*not compatible with vLLM")

with maybe_raises:
check_implementation(hf_runner,
vllm_runner,
example_prompts,
model,
model_impl=model_impl)


@multi_gpu_test(num_gpus=2)
def test_distributed(
hf_runner,
vllm_runner,
example_prompts,
):
kwargs = {"model_impl": "transformers", "tensor_parallel_size": 2}
check_implementation(hf_runner, vllm_runner, example_prompts,
"meta-llama/Llama-3.2-1B-Instruct", **kwargs)
14 changes: 14 additions & 0 deletions vllm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,12 @@ def compute_hash(self) -> str:
...


class ModelImpl(str, enum.Enum):
AUTO = "auto"
VLLM = "vllm"
TRANSFORMERS = "transformers"


class ModelConfig:
"""Configuration for the model.

Expand Down Expand Up @@ -165,6 +171,12 @@ class ModelConfig:
`logits_processors` extra completion argument. Defaults to None,
which allows no processors.
generation_config: Configuration parameter file for generation.
model_impl: Which implementation of the model to use:
"auto" will try to use the vLLM implementation if it exists and
fall back to the Transformers implementation if no vLLM
implementation is available.
"vllm" will use the vLLM model implementation.
"transformers" will use the Transformers model implementation.
override_generation_config: Override the generation config with the
given config.
"""
Expand Down Expand Up @@ -228,6 +240,7 @@ def __init__(
generation_config: Optional[str] = None,
enable_sleep_mode: bool = False,
override_generation_config: Optional[Dict[str, Any]] = None,
model_impl: Union[str, ModelImpl] = ModelImpl.AUTO,
) -> None:
self.model = model
self.tokenizer = tokenizer
Expand All @@ -239,6 +252,7 @@ def __init__(
self.code_revision = code_revision
self.rope_scaling = rope_scaling
self.rope_theta = rope_theta
self.model_impl = model_impl

if hf_overrides is None:
hf_overrides = {}
Expand Down
22 changes: 18 additions & 4 deletions vllm/engine/arg_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@
from vllm.config import (CacheConfig, CompilationConfig, ConfigFormat,
DecodingConfig, DeviceConfig, HfOverrides,
KVTransferConfig, LoadConfig, LoadFormat, LoRAConfig,
ModelConfig, ObservabilityConfig, ParallelConfig,
PoolerConfig, PromptAdapterConfig, SchedulerConfig,
SpeculativeConfig, TaskOption, TokenizerPoolConfig,
VllmConfig)
ModelConfig, ModelImpl, ObservabilityConfig,
ParallelConfig, PoolerConfig, PromptAdapterConfig,
SchedulerConfig, SpeculativeConfig, TaskOption,
TokenizerPoolConfig, VllmConfig)
from vllm.executor.executor_base import ExecutorBase
from vllm.logger import init_logger
from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS
Expand Down Expand Up @@ -197,6 +197,7 @@ class EngineArgs:
generation_config: Optional[str] = None
override_generation_config: Optional[Dict[str, Any]] = None
enable_sleep_mode: bool = False
model_impl: str = "auto"

calculate_kv_scales: Optional[bool] = None

Expand Down Expand Up @@ -376,6 +377,18 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
'qualified names that can be passed with the `logits_processors` '
'extra completion argument. Defaults to None, which allows no '
'processors.')
parser.add_argument(
'--model-impl',
type=str,
default=EngineArgs.model_impl,
choices=[f.value for f in ModelImpl],
help='Which implementation of the model to use.\n\n'
'* "auto" will try to use the vLLM implementation if it exists '
'and fall back to the Transformers implementation if no vLLM '
'implementation is available.\n'
'* "vllm" will use the vLLM model implementation.\n'
'* "transformers" will use the Transformers model '
'implementation.\n')
# Parallel arguments
parser.add_argument(
'--distributed-executor-backend',
Expand Down Expand Up @@ -1016,6 +1029,7 @@ def create_model_config(self) -> ModelConfig:
generation_config=self.generation_config,
override_generation_config=self.override_generation_config,
enable_sleep_mode=self.enable_sleep_mode,
model_impl=self.model_impl,
)

def create_load_config(self) -> LoadConfig:
Expand Down
47 changes: 45 additions & 2 deletions vllm/model_executor/model_loader/utils.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,22 @@
"""Utilities for selecting and loading models."""
import contextlib
from dataclasses import dataclass, field
from typing import Dict, List, Tuple, Type
from typing import Dict, List, Optional, Tuple, Type

import torch
import transformers
from torch import nn
from transformers.dynamic_module_utils import get_class_from_dynamic_module

from vllm.config import ModelConfig
from vllm.config import ModelConfig, ModelImpl
from vllm.logger import init_logger
from vllm.model_executor.models import ModelRegistry
from vllm.model_executor.models.adapters import (as_classification_model,
as_embedding_model,
as_reward_model)

logger = init_logger(__name__)


@contextlib.contextmanager
def set_default_torch_dtype(dtype: torch.dtype):
Expand All @@ -22,6 +27,16 @@ def set_default_torch_dtype(dtype: torch.dtype):
torch.set_default_dtype(old_dtype)


def is_transformers_impl_compatible(
arch: str,
module: Optional[transformers.PreTrainedModel] = None) -> bool:
mod = module if module is not None else getattr(transformers, arch)
if hasattr(mod, "supports_backend"):
return mod.is_backend_compatible()
else:
return mod._supports_flex_attn


def get_model_architecture(
model_config: ModelConfig) -> Tuple[Type[nn.Module], str]:
architectures = getattr(model_config.hf_config, "architectures", [])
Expand All @@ -37,6 +52,34 @@ def get_model_architecture(
and "MixtralForCausalLM" in architectures):
architectures = ["QuantMixtralForCausalLM"]

vllm_supported_archs = ModelRegistry.get_supported_archs()
for i, arch in enumerate(architectures):
if arch == "TransformersModel":
continue
custom_module = None
auto_map = getattr(model_config.hf_config, "auto_map", None)
if auto_map is not None and hasattr(auto_map, "AutoModel"):
custom_module = get_class_from_dynamic_module(
model_config.hf_config.auto_map["AutoModel"],
model_config.model)
if model_config.model_impl == ModelImpl.TRANSFORMERS:
if not is_transformers_impl_compatible(arch, custom_module):
raise ValueError(
f"The Transformers implementation of {arch} is not "
"compatible with vLLM.")
architectures[i] = "TransformersModel"
if (model_config.model_impl == ModelImpl.AUTO
and arch not in vllm_supported_archs):
if not is_transformers_impl_compatible(arch, custom_module):
raise ValueError(
f"{arch} has no vLLM implementation and the Transformers "
"implementation is not compatible with vLLM.")
logger.warning(
"%s has no vLLM implementation, falling back to Transformers "
"implementation. Some features may not be supported and "
"performance may not be optimal.", arch)
architectures[i] = "TransformersModel"

model_cls, arch = ModelRegistry.resolve_model_cls(architectures)
if model_config.task == "embed":
model_cls = as_embedding_model(model_cls)
Expand Down
Loading