-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Model] Add simple_model_runner and a new model XLMRobertaForSequenceClassification through multimodal interface #6260
base: main
Are you sure you want to change the base?
Conversation
If I'm understanding this PR correctly, you are basically using the multi-modal interface to pass data directly to the model (in this case the input IDs and attention mask). Are you working towards making vLLM function out-of-the-box with generic HuggingFace models? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial comments.
Yes, you are right. In fact, I have two goals:
After I replaced XLMRobertaForSequenceClassification's query/key/value linear with QKVParallelLinear, I saw a 15% performance improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some suggestions to improve the type annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config and multi-modal parts LGTM. I assume the model is implemented correctly since the tests pass.
However, since I'm not involved with the internals of model executor, block manager and workers, I'll leave it to @robertgshaw2-neuralmagic to review those. He will also see how to integrate this into the existing work for encoder-decoder support.
Im going to take a look at this over the weekend. Thanks @AllenDou! |
mark #6424 |
/ready |
Hello @robertgshaw2-neuralmagic , just a friendly reminder to review this PR when you get a chance, |
Thanks @AllenDou ! this is on my list |
mark #6789 |
I think @robertgshaw2-neuralmagic has been very busy lately... Are you still interested in implementing this? Let's split the PR into smaller parts that are more manageable so that @robertgshaw2-neuralmagic doesn't have to review everything at once. (By the way, please take into account the changes from #4942.) |
Of course, after reviewing #4942, I realized that nm folks have their own methods for supporting 'anymodel'. So, I think maybe I should close this PR. If this feature is needed in the future, I can reopen this PR at any time. |
Great, I'll resolve the conflicts and reopen the PR, I hope this won't take up too much of your time :) |
Thanks so much, I really appreciate your patience. Resolving our performance issues has just become a huge priority |
Hi @robertgshaw2-neuralmagic , I have rebased and reopened this PR. Please take a look at it. The offline_inference_xlmroberta_awq.py file is a quantized version I've been working on, but I'm encountering some technical difficulties that are proving challenging to overcome. So, this file could be removed if necessary. cc @DarkLight1337 |
offline_inference_xlmroberta_awq.py is deleted, as after hacking autoawq for xlmroberta model, I see no gain of performance benifit under vllm serving. Trying to fp8. |
39ffa68
to
0bc847f
Compare
…quenceClassification through multimodal interface
@AllenDou @robertgshaw2-neuralmagic @DarkLight1337 |
Maybe we should wait until @robertgshaw2-neuralmagic gets a chance to review this PR. |
A quick heads-up that the new locations of the model tests have been adjusted in #7820, so please merge from main. |
Also @robertgshaw2-neuralmagic do you have any timeframe on when you will be available to review this PR? |
@AllenDou Hello, thank you for your contribution. I use concurrency to request services, and there will be a series of error reports, including the following errors: Can you please help to optimize async requests? Thanks Note: |
This PR currently does not support frontend access through the http protocol. By the way, the XLM-Roberta model compares the similarity of two strings. Therefore, you need to pass a tuple containing two strings (string, string) as input. Please refer to examples/offline_inference_xlmroberta.py for more details. |
@AllenDou Thanks for your reply, can you tell me how to add support for http requests on your branch? Thank you, or can you add this function if it is convenient for you? My final goal is to need llamaForSequenceClassification based on http request, I have added llamaForSequenceClassification to llama.py with reference to your branch, and implemented the forward function, but currently I cannot make the correct request through http. |
This PR,
which processes input data through a multimodal interface like the following
class ModelMode [DECODER, ENCODER, ENCODER_DECODER, EMBEDDING, SIMPLE]
I have two goals:
CLOSE #6424
CLOSE #6789