-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API #3734
Merged
Merged
Changes from all commits
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
4d0932b
Disable KV Cache for Embedding serving and Add Embedding generation
CatherineSue 473449b
Make LlamaEmbeddingModel generate normalized embeddings
CatherineSue 07fc304
Rename BlockSpaceManagerV3 to EmbeddingModelBlockSpaceManager
CatherineSue f8fdd4f
Clean up LlamaEmbeddingModel
CatherineSue 8af04f2
Use ModelRegistry to enable ModelConfig.embedding_mode
CatherineSue e937412
Separating PoolerOutput, PoolingParams from SamplingXXX
CatherineSue e59c6a5
Separating LLM.encode() from LLM.generate()
CatherineSue 79aa971
Add tests for LlamaEmbeddingModel and OpenaiAPI server embedding
CatherineSue f002d3c
Fix errors caused by rebase
CatherineSue 97a493d
Update vllm/engine/async_llm_engine.py
robertgshaw2-neuralmagic 182ff09
Apply suggestions from code review
CatherineSue 29f888e
Resolve comments
CatherineSue a7dc484
Fix EntryPointsTest, ModelsTest and rebase
CatherineSue a744fd1
Revert `CompletionRequestOutput` to `RequestOutput`
CatherineSue 128dfdd
Update EmbeddingModelBlockSpaceManager interface
CatherineSue 25337de
Move sentence-transformers to requirements-common.txt
CatherineSue 80ed358
Fix Models Test and update interface for embedding_block_manager
CatherineSue 4936aa5
Rebase
CatherineSue 30785e6
Fix Models Test
CatherineSue 6cbd697
Merge branch 'main' into embedding
robertgshaw2-neuralmagic 1bf8531
format
robertgshaw2-neuralmagic f4c17a4
added test_embedding
robertgshaw2-neuralmagic 39b2973
added examples
robertgshaw2-neuralmagic 6bdb32e
cleanup
robertgshaw2-neuralmagic 9b7eccc
cleanup
robertgshaw2-neuralmagic 55a280e
cleanup
robertgshaw2-neuralmagic aa5c82a
cleanup
robertgshaw2-neuralmagic 0e9d79c
new line
robertgshaw2-neuralmagic cc3224f
reducing changes
robertgshaw2-neuralmagic af3ef42
simplify test changes
robertgshaw2-neuralmagic 45732b7
simplify test changes
robertgshaw2-neuralmagic 6e8243f
simplify test changes
robertgshaw2-neuralmagic acf210b
simplify test changes
robertgshaw2-neuralmagic 1801636
style for setting up embedding mode in model_config
robertgshaw2-neuralmagic d97b64d
nit on engineargs
robertgshaw2-neuralmagic 3655086
updated comment
robertgshaw2-neuralmagic 2c6ae80
cleanup
robertgshaw2-neuralmagic 9303a60
removed change from llama.py
robertgshaw2-neuralmagic 5adda0a
final review
robertgshaw2-neuralmagic 8747bf6
final review
robertgshaw2-neuralmagic 8475e5f
format
robertgshaw2-neuralmagic aba7e0c
Merge branch 'main' into embedding
robertgshaw2-neuralmagic 570b04a
Update conftest.py
robertgshaw2-neuralmagic File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
from vllm import LLM | ||
|
||
# Sample prompts. | ||
prompts = [ | ||
"Hello, my name is", | ||
"The president of the United States is", | ||
"The capital of France is", | ||
"The future of AI is", | ||
] | ||
|
||
# Create an LLM. | ||
model = LLM(model="intfloat/e5-mistral-7b-instruct", enforce_eager=True) | ||
# Generate embedding. The output is a list of EmbeddingRequestOutputs. | ||
outputs = model.encode(prompts) | ||
# Print the outputs. | ||
for output in outputs: | ||
print(output.outputs.embedding) # list of 4096 floats |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
from openai import OpenAI | ||
|
||
# Modify OpenAI's API key and API base to use vLLM's API server. | ||
openai_api_key = "EMPTY" | ||
openai_api_base = "http://localhost:8000/v1" | ||
|
||
client = OpenAI( | ||
# defaults to os.environ.get("OPENAI_API_KEY") | ||
api_key=openai_api_key, | ||
base_url=openai_api_base, | ||
) | ||
|
||
models = client.models.list() | ||
model = models.data[0].id | ||
|
||
responses = client.embeddings.create(input=[ | ||
"Hello my name is", | ||
"The best thing about vLLM is that it supports many different models" | ||
], | ||
model=model) | ||
|
||
for data in responses.data: | ||
print(data.embedding) # list of float of len 4096 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
"""Compare the outputs of HF and vLLM for Mistral models using greedy sampling. | ||
|
||
Run `pytest tests/models/test_llama_embedding.py`. | ||
""" | ||
import pytest | ||
import torch | ||
import torch.nn.functional as F | ||
|
||
MODELS = [ | ||
"intfloat/e5-mistral-7b-instruct", | ||
] | ||
|
||
|
||
def compare_embeddings(embeddings1, embeddings2): | ||
similarities = [ | ||
F.cosine_similarity(torch.tensor(e1), torch.tensor(e2), dim=0) | ||
for e1, e2 in zip(embeddings1, embeddings2) | ||
] | ||
return similarities | ||
|
||
|
||
@pytest.mark.parametrize("model", MODELS) | ||
@pytest.mark.parametrize("dtype", ["half"]) | ||
def test_models( | ||
hf_runner, | ||
vllm_runner, | ||
example_prompts, | ||
model: str, | ||
dtype: str, | ||
) -> None: | ||
hf_model = hf_runner(model, dtype=dtype) | ||
hf_outputs = hf_model.encode(example_prompts) | ||
del hf_model | ||
|
||
vllm_model = vllm_runner(model, dtype=dtype) | ||
vllm_outputs = vllm_model.encode(example_prompts) | ||
del vllm_model | ||
|
||
similarities = compare_embeddings(hf_outputs, vllm_outputs) | ||
all_similarities = torch.stack(similarities) | ||
tolerance = 1e-2 | ||
assert torch.all((all_similarities <= 1.0 + tolerance) | ||
& (all_similarities >= 1.0 - tolerance) | ||
), f"Not all values are within {tolerance} of 1.0" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CatherineSue here is where you are running OOM
I pushed a fix for this to your branch, but looks like it got overridden
The model is getting loaded at fp32 here, so its consuming 7*4=28GB or RAM on a machine with 24GB of RAM
Load the model at FP16 here and you will be good to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!! applied. Running CI now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the test is still failing.
I have changed the dtype to fp16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passed device=cpu in SentenceTransformer init. This test is passed in Models Test now.