Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHOAIENG-10023: Adding procedure for Speculative Decoding and Multi-Modal Inferencing #406

Merged
merged 4 commits into from
Aug 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions assemblies/serving-large-models.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,12 @@ In the single-model serving platform, you can view performance metrics for a spe

include::modules/viewing-performance-metrics-for-deployed-model.adoc[leveloffset=+2]

== Optimizing model-serving runtimes
syaseen-rh marked this conversation as resolved.
Show resolved Hide resolved

You can optionally enhance the preinstalled model-serving runtimes available in {productname-short} to leverage additional benefits and capabilities, such as optimized inferencing, reduced latency, and fine-tuned resource allocation.

include::modules/optimizing-the-vllm-runtime.adoc[leveloffset=+2]

== Performance tuning on the single-model serving platform
Certain performance issues might require you to tune the parameters of your inference service or model-serving runtime.

Expand Down
174 changes: 174 additions & 0 deletions modules/optimizing-the-vllm-runtime.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
:_module-type: PROCEDURE

[id="optimizing-the-vllm-runtime_{context}"]
= Optimizing the vLLM model-serving runtime

You can configure the *vLLM ServingRuntime for KServe* runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for Large Language Models (LLMs).

You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

To configure the *vLLM ServingRuntime for KServe* runtime for speculative decoding or multi-modal inferencing, you must add additional arguments in the vLLM model-serving runtime.

[role='_abstract']

.Prerequisites

* You have logged in to {productname-long}.
ifdef::upstream[]
* If you are using specialized {productname-short} groups, you are part of the admin group (for example, `odh-admin-group`) in OpenShift.
endif::[]
ifndef::upstream[]
* If you are using specialized {productname-short} groups, you are part of the admin group (for example, `oai-admin-group`) in OpenShift.
endif::[]
* If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.


.Procedure
. From the {productname-short} dashboard, click *Settings* > *Serving runtimes*.
+
The *Serving runtimes* page opens and shows the model-serving runtimes that are already installed and enabled.
+
. Based on the runtime that you used to deploy your model, perform one of the following actions:
+
ifdef::upstream[]
* If you used the pre-installed *vLLM ServingRuntime for KServe* runtime, duplicate the runtime to create a custom version and then follow the remainder of this procedure. For more information about duplicating the pre-installed vLLM runtime, see {odhdocshome}{default-format-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Adding a custom model-serving runtime for the single-model serving platform].
endif::[]
ifndef::upstream[]
* If you used the pre-installed *vLLM ServingRuntime for KServe* runtime, duplicate the runtime to create a custom version and then follow the remainder of this procedure. For more information about duplicating the pre-installed vLLM runtime, see {rhoaidocshome}{default-format-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Adding a custom model-serving runtime for the single-model serving platform].
endif::[]
* If you were already using a custom vLLM runtime, click the action menu (⋮) next to the runtime and select *Edit*.
+
The embedded YAML editor opens and shows the contents of the custom model-serving runtime.
. To configure the vLLM model-serving runtime for speculative decoding by matching n-grams in the prompt, add the following arguments:
+
[source]
----
containers:
- args:
- --speculative-model=[ngram]
- --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
- --ngram-prompt-lookup-max=<NGRAM_PROMPT_LOOKUP_MAX>
- --use-v2-block-manager
----
+
Replace `<NUM_SPECULATIVE_TOKENS>` and `<NGRAM_PROMPT_LOOKUP_MAX>` with your own values.
. To configure the vLLM model-serving runtime for speculative decoding with a draft model:
.. Remove the `--model` argument:
+
[source]
----
containers:
- args:
- --model=/mnt/models
----
.. Add the following arguments:
+
[source]
----
containers:
- args:
- --port=8080
- --served-model-name={{.Name}}
- --distributed-executor-backend=mp
- --model=/mnt/models/<path_to_original_model>
- --speculative-model=/mnt/models/<path_to_speculative_model>
syaseen-rh marked this conversation as resolved.
Show resolved Hide resolved
- --num-speculative-tokens=<NUM_SPECULATIVE_TOKENS>
- --use-v2-block-manager
----
+
Replace `<path_to_speculative_model>` and `<path_to_original_model>` with the paths to the speculative model and original model on your S3-compatible object storage. Replace all other placeholder values with your own.
. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments:
+
[source]
----
containers:
- args:
- --trust-remote-code
----
+
[NOTE]
====
Only use the `--trust-remote-code` argument with models from trusted sources.
====
. Click *Update*.
+
The *Serving runtimes* page opens and shows the list of runtimes that are installed. Confirm that the custom model-serving runtime you updated is shown.
. For speculative decoding, you must additionally redeploy the `InferenceService` custom resource definition (CRD) for the vLLM model-serving runtime as follows:
.. Log in to the OpenShift CLI.
.. List the available inference services in your namespace:
+
[source]
----
oc get -n <namespace> isvc
----
Note the name of the `InferenceService` that needs to be re-deployed.
.. Save the `InferenceService` manifest to a YAML file:
+
[source]
----
oc get -n <namespace> isvc <inference-service-name> -o yaml > inferenceservice.yml
----
Replace the placeholder values with your own.
.. Delete the exisiting `InferenceService` CRD:
+
[source]
----
oc delete -n <namespace> isvc <inference-service-name>
----
Replace the placeholder values with your own.
.. Deploy the modified `InferenceService` CRD using the YAML file that you saved:
+
[source]
----
oc apply -f inferenceservice.yml
----
.. Optional: Check the status of the `InferenceService` deployment as follows:
+
[source]
----
oc get pod,isvc -n <namespace>
----
Replace placeholder values with your own.
ifdef::upstream[]
. Deploy the model by using the custom runtime as described in {odhdocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform].
endif::[]
ifndef::upstream[]
. Deploy the model by using the custom runtime as described in {rhoaidocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform].
endif::[]

.Verification

* If you have configured the vLLM model-serving runtime for speculative decoding, use the following example command to verify API requests to your deployed model:
+
[source]
----
curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"
Copy link

@dtrifiro dtrifiro Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth it to mention that the authorization header is only required if adding --api-key in the vllm commandline arguments (InferenceService or ServingRuntime):

from vllm --help (or see t

--api-key API_KEY     If provided, the server will require this key to be
                        presented in the header.

See https://docs.vllm.ai/en/v0.5.4/serving/openai_compatible_server.html and/or https://docs.vllm.ai/en/v0.5.4/serving/env_vars.html#environment-variables

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. This authorization is intended for Authernio token, not for the vLLM API key. We don’t provide documentation for the API key on our VLLM server, so I don’t think it needs to be included here api key in our doc and have the doc as it is.

----
* If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:
+
[source]
----
curl -v https://<inference_endpoint_url>:443/v1/chat/completions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also add an example with the /v1/completions endpoint. In that case

-d '{"model":"<model_name>", "prompt": "<text>"}' ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtrifiro AFAIK :443/v1/completions does not handle image url or anything in such, So based on that i don't think we can add completion endpoints

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's there and part of the OpenAI API spec, although it's being deprecated: https://platform.openai.com/docs/guides/completions

Since it should be equivalent to the chat API (abeit a bit simpler), I guess we can leave it out

-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"
-d '{"model":"<model_name>",
"messages":
[{"role":"<role>",
"content":
[{"type":"text", "text":"<text>"
},
{"type":"image_url", "image_url":"<image_url_link>"
}
]
}
]
}'
----

[role='_additional-resources']
.Additional resources

* link:https://docs.vllm.ai/en/latest/models/engine_args.html[vLLM Engine Arguments]
* link:https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html[OpenAI Compatible Server]