RHOAIENG-10023: Adding procedure for Speculative Decoding and Multi-Modal Inferencing #406

syaseen-rh · 2024-08-12T18:08:38Z

Description

Adding procedure to configure speculative decoding and multi-modal inferencing for the vLLM runtime

How Has This Been Tested?

Local Build

Preview

modules/optimizing-the-vllm-runtime.adoc

eturner24 · 2024-08-12T18:30:48Z

modules/optimizing-the-vllm-runtime.adoc

+ifndef::upstream[]
+* If you are using specialized {productname-short} groups, you are part of the admin group (for example, `oai-admin-group`) in OpenShift.
+endif::[]
+* To use the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.


Suggested change

* To use the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.

* If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.

modules/optimizing-the-vllm-runtime.adoc

eturner24 · 2024-08-12T18:32:49Z

modules/optimizing-the-vllm-runtime.adoc

+      - --use-v2-block-manager
+----
+ 
+Replace `[NUM_SPECULATIVE_TOKENS]` and `[NGRAM_PROMPT_LOOKUP_MAX]` with your own values.


Suggested change

Replace `[NUM_SPECULATIVE_TOKENS]` and `[NGRAM_PROMPT_LOOKUP_MAX]` with your own values.

Replace `<NUM_SPECULATIVE_TOKENS>` and `<NGRAM_PROMPT_LOOKUP_MAX>` with your own values.

modules/optimizing-the-vllm-runtime.adoc

eturner24 · 2024-08-12T18:34:33Z

modules/optimizing-the-vllm-runtime.adoc

+====
+. Click Update.
+
+The *Serving runtimes* page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.


Suggested change

The *Serving runtimes* page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.

The *Serving runtimes* page opens and shows the list of runtimes that are installed. Confirm that the custom model-serving runtime you updated is shown.

eturner24 · 2024-08-12T18:35:35Z

modules/optimizing-the-vllm-runtime.adoc

+The *Serving runtimes* page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.
+. For speculative decoding, you must additionally redeploy the `InferenceService` custom resource definition (CRD) for the vLLM model-serving runtime as follows:
+.. Log in to the OpenShift CLI.
+.. List the available inference services in your namespace.


Suggested change

.. List the available inference services in your namespace.

.. List the available inference services in your namespace:

eturner24 · 2024-08-12T18:36:01Z

modules/optimizing-the-vllm-runtime.adoc

+oc delete -n <namespace> isvc <inference-service-name>
+----
+Replace the placeholder values with your own.
+.. Deploy the modified `InferenceService` CRD using the YAML file that you saved.


Suggested change

.. Deploy the modified `InferenceService` CRD using the YAML file that you saved.

.. Deploy the modified `InferenceService` CRD using the YAML file that you saved:

eturner24 · 2024-08-12T18:37:30Z

modules/optimizing-the-vllm-runtime.adoc

+----
+Replace placeholder values with your own.
+ifdef::upstream[]
+. Deploy the model by using the custom runtime  as described in {odhdocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform].


Suggested change

. Deploy the model by using the custom runtime as described in {odhdocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform].

. Deploy the model by using the custom runtime as described in {odhdocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform].

assemblies/serving-large-models.adoc

dtrifiro · 2024-08-13T13:56:01Z

modules/optimizing-the-vllm-runtime.adoc

+
+[source]
+----
+curl -v https://<inference_endpoint_url>:443/v1/chat/completions 


Could also add an example with the /v1/completions endpoint. In that case

-d '{"model":"<model_name>", "prompt": "<text>"}' ...

@dtrifiro AFAIK :443/v1/completions does not handle image url or anything in such, So based on that i don't think we can add completion endpoints

Well it's there and part of the OpenAI API spec, although it's being deprecated: https://platform.openai.com/docs/guides/completions

Since it should be equivalent to the chat API (abeit a bit simpler), I guess we can leave it out

dtrifiro · 2024-08-13T13:57:40Z

modules/optimizing-the-vllm-runtime.adoc

+----
+curl -v https://<inference_endpoint_url>:443/v1/chat/completions 
+-H "Content-Type: application/json" 
+-H "Authorization: Bearer <token>"


Might be worth it to mention that the authorization header is only required if adding --api-key in the vllm commandline arguments (InferenceService or ServingRuntime):

from vllm --help (or see t

--api-key API_KEY If provided, the server will require this key to be presented in the header.

See https://docs.vllm.ai/en/v0.5.4/serving/openai_compatible_server.html and/or https://docs.vllm.ai/en/v0.5.4/serving/env_vars.html#environment-variables

Not really. This authorization is intended for Authernio token, not for the vLLM API key. We don’t provide documentation for the API key on our VLLM server, so I don’t think it needs to be included here api key in our doc and have the doc as it is.

modules/optimizing-the-vllm-runtime.adoc

eturner24 · 2024-08-14T19:31:31Z

modules/optimizing-the-vllm-runtime.adoc

+----
+ 
+Replace `<NUM_SPECULATIVE_TOKENS>` and `<NGRAM_PROMPT_LOOKUP_MAX>` with your own values.
+. To configure the vLLM model-serving runtime for speculative decoding with a draft model, do the following:


Suggested change

. To configure the vLLM model-serving runtime for speculative decoding with a draft model, do the following:

. To configure the vLLM model-serving runtime for speculative decoding with a draft model:

eturner24 · 2024-08-14T19:32:03Z

modules/optimizing-the-vllm-runtime.adoc

+
+You can configure the *vLLM ServingRuntime for KServe* runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for Large Language Models (LLMs).
+
+You can also configure the runtime to support inferencing for Vision-Language modals (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.


Suggested change

You can also configure the runtime to support inferencing for Vision-Language modals (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

eturner24 · 2024-08-14T19:32:15Z

modules/optimizing-the-vllm-runtime.adoc

+----
+ 
+Replace `<path_to_speculative_model>` and `<path_to_original_model>` with the paths to the speculative model and original model on your S3-compatible object storage. Replace all other placeholder values with your own.
+. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments as shown:


Suggested change

. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments as shown:

. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments:

eturner24 · 2024-08-14T19:37:08Z

modules/optimizing-the-vllm-runtime.adoc

+-H "Content-Type: application/json" 
+-H "Authorization: Bearer <token>"
+----
+* If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the Vision-Language Model (VLM) that you have deployed:


Suggested change

* If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the Vision-Language Model (VLM) that you have deployed:

* If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:

tarukumar

Let wait for @dtrifiro feedback on the open comment

syaseen-rh requested review from dtrifiro and tarukumar August 12, 2024 18:09

eturner24 requested changes Aug 12, 2024

View reviewed changes

dtrifiro reviewed Aug 13, 2024

View reviewed changes

tarukumar reviewed Aug 14, 2024

View reviewed changes

modules/optimizing-the-vllm-runtime.adoc Show resolved Hide resolved

syaseen-rh added 2 commits August 14, 2024 13:57

Adding procedure for Speculative Decoding and Multi-Modal Inferencing

409961f

addressing review feedback

325edbb

syaseen-rh force-pushed the RHOAIENG-10023 branch from d7cd44d to 325edbb Compare August 14, 2024 17:57

fixing minor typo

69aa5b8

eturner24 approved these changes Aug 14, 2024

View reviewed changes

addressing review feedback

cb2f82c

tarukumar approved these changes Aug 19, 2024

View reviewed changes

dtrifiro approved these changes Aug 19, 2024

View reviewed changes

syaseen-rh merged commit ffcfb03 into opendatahub-io:main Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RHOAIENG-10023: Adding procedure for Speculative Decoding and Multi-Modal Inferencing #406

RHOAIENG-10023: Adding procedure for Speculative Decoding and Multi-Modal Inferencing #406

syaseen-rh commented Aug 12, 2024 •

edited

Loading

eturner24 Aug 12, 2024

eturner24 Aug 12, 2024

eturner24 Aug 12, 2024

eturner24 Aug 12, 2024

eturner24 Aug 12, 2024

eturner24 Aug 12, 2024

dtrifiro Aug 13, 2024

tarukumar Aug 14, 2024

dtrifiro Aug 19, 2024

dtrifiro Aug 13, 2024 •

edited

Loading

tarukumar Aug 14, 2024

eturner24 Aug 14, 2024

eturner24 Aug 14, 2024

eturner24 Aug 14, 2024

eturner24 Aug 14, 2024

tarukumar left a comment

	* To use the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.
	* If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.

	Replace `[NUM_SPECULATIVE_TOKENS]` and `[NGRAM_PROMPT_LOOKUP_MAX]` with your own values.
	Replace `<NUM_SPECULATIVE_TOKENS>` and `<NGRAM_PROMPT_LOOKUP_MAX>` with your own values.

	The Serving runtimes page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.
	The Serving runtimes page opens and shows the list of runtimes that are installed. Confirm that the custom model-serving runtime you updated is shown.

	.. List the available inference services in your namespace.
	.. List the available inference services in your namespace:

	.. Deploy the modified `InferenceService` CRD using the YAML file that you saved.
	.. Deploy the modified `InferenceService` CRD using the YAML file that you saved:

	. Deploy the model by using the custom runtime as described in {odhdocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform].
	. Deploy the model by using the custom runtime as described in {odhdocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform].

	. To configure the vLLM model-serving runtime for speculative decoding with a draft model, do the following:
	. To configure the vLLM model-serving runtime for speculative decoding with a draft model:


		You can configure the vLLM ServingRuntime for KServe runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for Large Language Models (LLMs).

		You can also configure the runtime to support inferencing for Vision-Language modals (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

	* If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the Vision-Language Model (VLM) that you have deployed:
	* If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:

RHOAIENG-10023: Adding procedure for Speculative Decoding and Multi-Modal Inferencing #406

RHOAIENG-10023: Adding procedure for Speculative Decoding and Multi-Modal Inferencing #406

Conversation

syaseen-rh commented Aug 12, 2024 • edited Loading

Description

How Has This Been Tested?

Preview

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtrifiro Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarukumar left a comment

Choose a reason for hiding this comment

syaseen-rh commented Aug 12, 2024 •

edited

Loading

dtrifiro Aug 13, 2024 •

edited

Loading