-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RHOAIENG-10023: Adding procedure for Speculative Decoding and Multi-Modal Inferencing #406
Conversation
ifndef::upstream[] | ||
* If you are using specialized {productname-short} groups, you are part of the admin group (for example, `oai-admin-group`) in OpenShift. | ||
endif::[] | ||
* To use the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* To use the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage. | |
* If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage. |
- --use-v2-block-manager | ||
---- | ||
+ | ||
Replace `[NUM_SPECULATIVE_TOKENS]` and `[NGRAM_PROMPT_LOOKUP_MAX]` with your own values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace `[NUM_SPECULATIVE_TOKENS]` and `[NGRAM_PROMPT_LOOKUP_MAX]` with your own values. | |
Replace `<NUM_SPECULATIVE_TOKENS>` and `<NGRAM_PROMPT_LOOKUP_MAX>` with your own values. |
==== | ||
. Click Update. | ||
+ | ||
The *Serving runtimes* page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The *Serving runtimes* page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown. | |
The *Serving runtimes* page opens and shows the list of runtimes that are installed. Confirm that the custom model-serving runtime you updated is shown. |
The *Serving runtimes* page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown. | ||
. For speculative decoding, you must additionally redeploy the `InferenceService` custom resource definition (CRD) for the vLLM model-serving runtime as follows: | ||
.. Log in to the OpenShift CLI. | ||
.. List the available inference services in your namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. List the available inference services in your namespace. | |
.. List the available inference services in your namespace: |
oc delete -n <namespace> isvc <inference-service-name> | ||
---- | ||
Replace the placeholder values with your own. | ||
.. Deploy the modified `InferenceService` CRD using the YAML file that you saved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. Deploy the modified `InferenceService` CRD using the YAML file that you saved. | |
.. Deploy the modified `InferenceService` CRD using the YAML file that you saved: |
---- | ||
Replace placeholder values with your own. | ||
ifdef::upstream[] | ||
. Deploy the model by using the custom runtime as described in {odhdocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. Deploy the model by using the custom runtime as described in {odhdocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform]. | |
. Deploy the model by using the custom runtime as described in {odhdocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform]. |
+ | ||
[source] | ||
---- | ||
curl -v https://<inference_endpoint_url>:443/v1/chat/completions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also add an example with the /v1/completions
endpoint. In that case
-d '{"model":"<model_name>", "prompt": "<text>"}' ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtrifiro AFAIK :443/v1/completions
does not handle image url or anything in such, So based on that i don't think we can add completion endpoints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it's there and part of the OpenAI API spec, although it's being deprecated: https://platform.openai.com/docs/guides/completions
Since it should be equivalent to the chat API (abeit a bit simpler), I guess we can leave it out
---- | ||
curl -v https://<inference_endpoint_url>:443/v1/chat/completions | ||
-H "Content-Type: application/json" | ||
-H "Authorization: Bearer <token>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth it to mention that the authorization header is only required if adding --api-key
in the vllm commandline arguments (InferenceService
or ServingRuntime
):
from vllm --help
(or see t
--api-key API_KEY If provided, the server will require this key to be
presented in the header.
See https://docs.vllm.ai/en/v0.5.4/serving/openai_compatible_server.html and/or https://docs.vllm.ai/en/v0.5.4/serving/env_vars.html#environment-variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. This authorization is intended for Authernio token, not for the vLLM API key. We don’t provide documentation for the API key on our VLLM server, so I don’t think it needs to be included here api key in our doc and have the doc as it is.
d7cd44d
to
325edbb
Compare
---- | ||
+ | ||
Replace `<NUM_SPECULATIVE_TOKENS>` and `<NGRAM_PROMPT_LOOKUP_MAX>` with your own values. | ||
. To configure the vLLM model-serving runtime for speculative decoding with a draft model, do the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. To configure the vLLM model-serving runtime for speculative decoding with a draft model, do the following: | |
. To configure the vLLM model-serving runtime for speculative decoding with a draft model: |
|
||
You can configure the *vLLM ServingRuntime for KServe* runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for Large Language Models (LLMs). | ||
|
||
You can also configure the runtime to support inferencing for Vision-Language modals (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also configure the runtime to support inferencing for Vision-Language modals (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data. | |
You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data. |
---- | ||
+ | ||
Replace `<path_to_speculative_model>` and `<path_to_original_model>` with the paths to the speculative model and original model on your S3-compatible object storage. Replace all other placeholder values with your own. | ||
. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments as shown: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments as shown: | |
. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments: |
-H "Content-Type: application/json" | ||
-H "Authorization: Bearer <token>" | ||
---- | ||
* If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the Vision-Language Model (VLM) that you have deployed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the Vision-Language Model (VLM) that you have deployed: | |
* If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let wait for @dtrifiro feedback on the open comment
Description
Adding procedure to configure speculative decoding and multi-modal inferencing for the vLLM runtime
How Has This Been Tested?
Local Build
Preview