Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model #1186

Merged
merged 2 commits into from
Aug 25, 2024

Conversation

zhaochenyang20
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 commented Aug 22, 2024

Motivation

Current SGLang only supports the e5-mistral embedding model. I added Alibaba-NLP/gte-Qwen2-7B-instruct model in this PR.

Also, previously SGLang determines a model as an embedding model through its hf_config.architectures. But gte model has the same architecture as CausalLM. So I added a new parameter in the server_args and changed the forward function of Qwen2ForCausalLM.

Modifications

  1. Changed the forward function of Qwen2ForCausalLM.
  2. Added a new parameter is_embedding in server_args.
  3. Some related changes.
  4. Added unit tests for gte models. (both in the generation and embedding tests. I used rouge-L score in the generation tests)
  5. Changed readme.

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
python/sglang/srt/server_args.py Outdated Show resolved Hide resolved
python/sglang/srt/managers/tokenizer_manager.py Outdated Show resolved Hide resolved
python/sglang/srt/managers/tokenizer_manager.py Outdated Show resolved Hide resolved
python/sglang/srt/model_executor/model_runner.py Outdated Show resolved Hide resolved
python/sglang/srt/models/qwen2.py Show resolved Hide resolved
python/sglang/srt/server.py Outdated Show resolved Hide resolved
python/sglang/srt/server.py Outdated Show resolved Hide resolved
test/srt/models/test_generation_models.py Outdated Show resolved Hide resolved
@zhaochenyang20
Copy link
Collaborator Author

@Ying1123 I added gte in the generation model test. Note that I changed the prefill tolerance accordingly and added the rouge-l metric instead of assert output_strs exactly the same.

@merrymercy merrymercy changed the title Support Alibaba-NLP/gte-Qwen2-7B-instruct Model Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model Aug 23, 2024
@Ying1123 Ying1123 enabled auto-merge (squash) August 25, 2024 07:33
@Ying1123 Ying1123 force-pushed the support_qwn2 branch 3 times, most recently from 930e83d to efb207b Compare August 25, 2024 07:45
@Ying1123 Ying1123 disabled auto-merge August 25, 2024 17:26
@Ying1123 Ying1123 merged commit 30b4f77 into sgl-project:main Aug 25, 2024
0 of 4 checks passed
@Ying1123 Ying1123 mentioned this pull request Aug 25, 2024
29 tasks
import multiprocessing as mp

try:
mp.set_start_method("spawn")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this be needed?

@llmforever
Copy link

llmforever commented Aug 28, 2024

@zhaochenyang20
您好,使用您提交的这个方法,和原始transformer与sentence transformer得到的embedding差距都很大,并且效果不好,您能帮忙看看吗?7B/1.5B的模型都测试过了
different result compare with orginal transformer backend,why?

prompt = “hello world”
sglang:
import openai
client = openai.Client(
base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.embeddings.create(
model="default",
input=prompt ,
)

transformer:
tokenizer = AutoTokenizer.from_pretrained('Alibaba-NLP/gte-Qwen2-7B-instruct', trust_remote_code=True)
model = AutoModel.from_pretrained('Alibaba-NLP/gte-Qwen2-7B-instruct', trust_remote_code=True)

max_length = 8192

batch_dict = tokenizer(prompt, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

embeddings = F.normalize(embeddings, p=2, dim=1)

@zhaochenyang20 zhaochenyang20 deleted the support_qwn2 branch September 1, 2024 12:42
@zhaochenyang20
Copy link
Collaborator Author

@llmforever hello. Sorry, I haven't noticed this before. Do you still need to fix this? Actually, we have a unit test for this in test/srt/models/test_embedding_models.py. The logits here is indeed closed.

Also, I don't understand what did you mean by "perform not so well". Could you provide your running snifts and your serving command for SGLang.

And, does e5-mistral also have this problem? Or only get?

@thomZ1
Copy link

thomZ1 commented Sep 2, 2024

image
The same problem, I tried using the SGLang OpenAI API and SentenceTransformer with the same prompt, but the output embeddings were different.

@zhaochenyang20
Copy link
Collaborator Author

Yeah. The embedding could be different due to a lot of reasons. @llmforever

You can check this unit test: https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_embedding_models.py

We set a tolerance value for the embedding difference.

Also, please try the e5-mistral model and give us the embedding difference.

https://huggingface.co/intfloat/e5-mistral-7b-instruct

@Ying1123 Do you think the difference provided is tolerable?

@llmforever
Copy link

llmforever commented Sep 4, 2024

Yeah. The embedding could be different due to a lot of reasons. @llmforever

You can check this unit test: https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_embedding_models.py

We set a tolerance value for the embedding difference.

Also, please try the e5-mistral model and give us the embedding difference.

https://huggingface.co/intfloat/e5-mistral-7b-instruct

@Ying1123 Do you think the difference provided is tolerable?

I test about 10 cases,each accuracy drop from 80% to less than 10%,i think the difference is not tolerable,but the result of the e5-mistral-7b-instruct model is the same,can you please help me look that? Here is the code i use to generate the embedding:

for transformer:

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel

input_texts = ['hello']
tokenizer = AutoTokenizer.from_pretrained('Alibaba-NLP/gte-Qwen2-7B-instruct', trust_remote_code=True)
model = AutoModel.from_pretrained('Alibaba-NLP/gte-Qwen2-7B-instruct', trust_remote_code=True)

max_length = 8192

batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

embeddings = F.normalize(embeddings, p=2, dim=1)

for sglang:
import openai
import time
client = openai.Client(
base_url="http://localhost:30000/v1", api_key="EMPTY")

input_texts = ['hello']

queres = client.embeddings.create(
model="default",
input=quelist,
)
embeddings = torch.tensor(response.data[0].embedding)

@zhaochenyang20
Copy link
Collaborator Author

@Ying1123 I think he provides an intolerable difference hummm? I gonna check it these days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants