Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Model] Add simple_model_runner and a new model XLMRobertaForSequenceClassification through multimodal interface #6260

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

AllenDou
Copy link
Contributor

@AllenDou AllenDou commented Jul 9, 2024

This PR,

  1. Add a new model XLMRobertaForSequenceClassification (for RAG scenario) https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py
    which processes input data through a multimodal interface like the following
outputs = llm.process([{
    "prompt": prompt,
    "multi_modal_data": {
        "xlmroberta": inputs,
    }
}])
  1. Introduce class ModelMode [DECODER, ENCODER, ENCODER_DECODER, EMBEDDING, SIMPLE]
  2. Add a new model runner(simple_model_runner.py), I didn't use model_runner.py or embedding_model_runner.py because they have procedures like logits(), sample(), and pooling() that we don't need. Maybe we could find a way to create a more general model runner to support 'any' model
  3. Add SimpleSequenceGroupOutput, SimpleRequestOutput, and SimpleOutput for general output
  4. Tensor-parallel is enabled, but no benefits have been achieved so far
  5. Test case/example added

I have two goals:

  1. To find a way to enable vllm to integrate 'any' model
  2. To replace these models' layers with vllm/layers/* as well as TP/PP to achieve performance improvement

CLOSE #6424
CLOSE #6789

@DarkLight1337 DarkLight1337 self-assigned this Jul 9, 2024
@DarkLight1337
Copy link
Member

If I'm understanding this PR correctly, you are basically using the multi-modal interface to pass data directly to the model (in this case the input IDs and attention mask). Are you working towards making vLLM function out-of-the-box with generic HuggingFace models?

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial comments.

examples/flagembedding.py Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
vllm/entrypoints/llm.py Outdated Show resolved Hide resolved
vllm/multimodal/bge.py Outdated Show resolved Hide resolved
vllm/multimodal/registry.py Outdated Show resolved Hide resolved
vllm/outputs.py Outdated Show resolved Hide resolved
@AllenDou
Copy link
Contributor Author

AllenDou commented Jul 9, 2024

If I'm understanding this PR correctly, you are basically using the multi-modal interface to pass data directly to the model (in this case the input IDs and attention mask). Are you working towards making vLLM function out-of-the-box with generic HuggingFace models?

Yes, you are right. In fact, I have two goals:

  1. To find a way to enable vllm to integrate 'any' model.
  2. To replace these models' layers with vllm/layers/* as well as TP/PP to achieve performance improvement.

After I replaced XLMRobertaForSequenceClassification's query/key/value linear with QKVParallelLinear, I saw a 15% performance improvement.

vllm/entrypoints/llm.py Outdated Show resolved Hide resolved
vllm/entrypoints/llm.py Outdated Show resolved Hide resolved
vllm/entrypoints/llm.py Outdated Show resolved Hide resolved
vllm/config.py Outdated Show resolved Hide resolved
vllm/engine/llm_engine.py Outdated Show resolved Hide resolved
vllm/core/scheduler.py Outdated Show resolved Hide resolved
vllm/engine/llm_engine.py Outdated Show resolved Hide resolved
vllm/entrypoints/llm.py Outdated Show resolved Hide resolved
vllm/entrypoints/llm.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/serving_embedding.py Outdated Show resolved Hide resolved
vllm/worker/worker.py Outdated Show resolved Hide resolved
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some suggestions to improve the type annotations.

vllm/model_executor/models/__init__.py Outdated Show resolved Hide resolved
vllm/model_executor/models/__init__.py Outdated Show resolved Hide resolved
vllm/model_executor/models/__init__.py Outdated Show resolved Hide resolved
vllm/model_executor/models/__init__.py Outdated Show resolved Hide resolved
vllm/config.py Outdated Show resolved Hide resolved
vllm/config.py Outdated Show resolved Hide resolved
vllm/config.py Outdated Show resolved Hide resolved
vllm/core/interfaces.py Outdated Show resolved Hide resolved
vllm/worker/worker.py Outdated Show resolved Hide resolved
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config and multi-modal parts LGTM. I assume the model is implemented correctly since the tests pass.

However, since I'm not involved with the internals of model executor, block manager and workers, I'll leave it to @robertgshaw2-neuralmagic to review those. He will also see how to integrate this into the existing work for encoder-decoder support.

@robertgshaw2-neuralmagic
Copy link
Collaborator

Im going to take a look at this over the weekend. Thanks @AllenDou!

@AllenDou
Copy link
Contributor Author

mark #6424

@AllenDou
Copy link
Contributor Author

/ready

@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 16, 2024
@AllenDou
Copy link
Contributor Author

Hello @robertgshaw2-neuralmagic , just a friendly reminder to review this PR when you get a chance,
We want to enable more models through this feature, thank a lot!

@robertgshaw2-neuralmagic
Copy link
Collaborator

Thanks @AllenDou ! this is on my list

@AllenDou
Copy link
Contributor Author

mark #6789

@AllenDou
Copy link
Contributor Author

mark #1187 #205

@AllenDou AllenDou closed this Aug 7, 2024
@DarkLight1337
Copy link
Member

I think @robertgshaw2-neuralmagic has been very busy lately... Are you still interested in implementing this? Let's split the PR into smaller parts that are more manageable so that @robertgshaw2-neuralmagic doesn't have to review everything at once.

(By the way, please take into account the changes from #4942.)

@AllenDou
Copy link
Contributor Author

AllenDou commented Aug 7, 2024

I think @robertgshaw2-neuralmagic has been very busy lately... Are you still interested in implementing this? Let's split the PR into smaller parts that are more manageable so that @robertgshaw2-neuralmagic doesn't have to review everything at once.

(By the way, please take into account the changes from #4942.)

Of course, after reviewing #4942, I realized that nm folks have their own methods for supporting 'anymodel'. So, I think maybe I should close this PR. If this feature is needed in the future, I can reopen this PR at any time.

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Aug 7, 2024

Hey @AllenDou Im really sorry for the delay here, have had no bandwidth over the past month

#4942 is only for encoder-decoder, which is separate from the feature here, so I do not think this work is a duplicate

I’m happy to pick this back up if you want to reopen the PR

@AllenDou
Copy link
Contributor Author

AllenDou commented Aug 7, 2024

Hey @AllenDou Im really sorry for the delay here, have had no bandwidth over the past month

#4942 is only for encoder-decoder, which is separate from the feature here, so I do not think this work is a duplicate

I’m happy to pick this back up if you want to reopen the PR

Great, I'll resolve the conflicts and reopen the PR, I hope this won't take up too much of your time :)

@robertgshaw2-neuralmagic
Copy link
Collaborator

Thanks so much, I really appreciate your patience. Resolving our performance issues has just become a huge priority

@AllenDou
Copy link
Contributor Author

AllenDou commented Aug 8, 2024

Hi @robertgshaw2-neuralmagic , I have rebased and reopened this PR. Please take a look at it. The offline_inference_xlmroberta_awq.py file is a quantized version I've been working on, but I'm encountering some technical difficulties that are proving challenging to overcome. So, this file could be removed if necessary. cc @DarkLight1337

@AllenDou
Copy link
Contributor Author

AllenDou commented Aug 9, 2024

offline_inference_xlmroberta_awq.py is deleted, as after hacking autoawq for xlmroberta model, I see no gain of performance benifit under vllm serving. Trying to fp8.

@AllenDou AllenDou force-pushed the bge_onemb2 branch 3 times, most recently from 39ffa68 to 0bc847f Compare August 14, 2024 04:20
…quenceClassification through multimodal interface
@prashil1996
Copy link

@AllenDou @robertgshaw2-neuralmagic @DarkLight1337
Is there any timeline on when this could be delivered?

@AllenDou
Copy link
Contributor Author

AllenDou commented Sep 1, 2024

@AllenDou @robertgshaw2-neuralmagic @DarkLight1337 Is there any timeline on when this could be delivered?

Maybe we should wait until @robertgshaw2-neuralmagic gets a chance to review this PR.

@DarkLight1337
Copy link
Member

A quick heads-up that the new locations of the model tests have been adjusted in #7820, so please merge from main.

@DarkLight1337
Copy link
Member

Also @robertgshaw2-neuralmagic do you have any timeframe on when you will be available to review this PR?

@fan-niu
Copy link

fan-niu commented Sep 25, 2024

@AllenDou Hello, thank you for your contribution. I use concurrency to request services, and there will be a series of error reports, including the following errors:
File "/share5/projects/llm/finetune-accelerate/inference/tensorrtllm-experiment/sglang/code/vllm_test/vllm_sqclassification/vllm/vllm/model_executor/models/xlmroberta.py", line 721, in forward batch_size, seq_length = input_shape ValueError: not enough values to unpack (expected 2, got 1)

Can you please help to optimize async requests? Thanks

Note:
I use http://127.0.0.1:8081/v1/completions, concurrency = 2,
payload:
{
"prompt": "what time is it",
"model":MODEL_NAME,
"stop": "<\s>",
"max_tokens":128,
"stream": False,
"presence_penalty": 0,
"frequency_penalty": 0,
"temperature": 0.0,
"top_p": 1,
}

@AllenDou
Copy link
Contributor Author

@AllenDou Hello, thank you for your contribution. I use concurrency to request services, and there will be a series of error reports, including the following errors: File "/share5/projects/llm/finetune-accelerate/inference/tensorrtllm-experiment/sglang/code/vllm_test/vllm_sqclassification/vllm/vllm/model_executor/models/xlmroberta.py", line 721, in forward batch_size, seq_length = input_shape ValueError: not enough values to unpack (expected 2, got 1)

Can you please help to optimize async requests? Thanks

Note: I use http://127.0.0.1:8081/v1/completions, concurrency = 2, payload: { "prompt": "what time is it", "model":MODEL_NAME, "stop": "<\s>", "max_tokens":128, "stream": False, "presence_penalty": 0, "frequency_penalty": 0, "temperature": 0.0, "top_p": 1, }

This PR currently does not support frontend access through the http protocol. By the way, the XLM-Roberta model compares the similarity of two strings. Therefore, you need to pass a tuple containing two strings (string, string) as input. Please refer to examples/offline_inference_xlmroberta.py for more details.

@fan-niu
Copy link

fan-niu commented Sep 25, 2024

@AllenDou Thanks for your reply, can you tell me how to add support for http requests on your branch? Thank you, or can you add this function if it is convenient for you? My final goal is to need llamaForSequenceClassification based on http request, I have added llamaForSequenceClassification to llama.py with reference to your branch, and implemented the forward function, but currently I cannot make the correct request through http.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
5 participants