Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHOAIENG-10023: Adding procedure for Speculative Decoding and Multi-Modal Inferencing #406

Merged
merged 4 commits into from
Aug 19, 2024

Conversation

syaseen-rh
Copy link
Contributor

@syaseen-rh syaseen-rh commented Aug 12, 2024

Description

Adding procedure to configure speculative decoding and multi-modal inferencing for the vLLM runtime

How Has This Been Tested?

Local Build

Preview

Screenshot 2024-08-12 at 2 15 04 PM Screenshot 2024-08-12 at 2 19 36 PM Screenshot 2024-08-12 at 2 19 56 PM Screenshot 2024-08-12 at 2 20 12 PM

modules/optimizing-the-vllm-runtime.adoc Outdated Show resolved Hide resolved
ifndef::upstream[]
* If you are using specialized {productname-short} groups, you are part of the admin group (for example, `oai-admin-group`) in OpenShift.
endif::[]
* To use the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* To use the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.
* If you are using the vLLM model-serving runtime for speculative decoding with a draft model, you have stored the original model and the speculative model in the same folder within your S3-compatible object storage.

modules/optimizing-the-vllm-runtime.adoc Outdated Show resolved Hide resolved
- --use-v2-block-manager
----
+
Replace `[NUM_SPECULATIVE_TOKENS]` and `[NGRAM_PROMPT_LOOKUP_MAX]` with your own values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Replace `[NUM_SPECULATIVE_TOKENS]` and `[NGRAM_PROMPT_LOOKUP_MAX]` with your own values.
Replace `<NUM_SPECULATIVE_TOKENS>` and `<NGRAM_PROMPT_LOOKUP_MAX>` with your own values.

modules/optimizing-the-vllm-runtime.adoc Outdated Show resolved Hide resolved
====
. Click Update.
+
The *Serving runtimes* page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The *Serving runtimes* page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.
The *Serving runtimes* page opens and shows the list of runtimes that are installed. Confirm that the custom model-serving runtime you updated is shown.

The *Serving runtimes* page opens and shows the list of runtimes that are installed. Observe that the custom model-serving runtime you updated is shown.
. For speculative decoding, you must additionally redeploy the `InferenceService` custom resource definition (CRD) for the vLLM model-serving runtime as follows:
.. Log in to the OpenShift CLI.
.. List the available inference services in your namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. List the available inference services in your namespace.
.. List the available inference services in your namespace:

oc delete -n <namespace> isvc <inference-service-name>
----
Replace the placeholder values with your own.
.. Deploy the modified `InferenceService` CRD using the YAML file that you saved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. Deploy the modified `InferenceService` CRD using the YAML file that you saved.
.. Deploy the modified `InferenceService` CRD using the YAML file that you saved:

----
Replace placeholder values with your own.
ifdef::upstream[]
. Deploy the model by using the custom runtime as described in {odhdocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Deploy the model by using the custom runtime as described in {odhdocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform].
. Deploy the model by using the custom runtime as described in {odhdocshome}{default-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Deploying models on the single-model serving platform].

assemblies/serving-large-models.adoc Show resolved Hide resolved
+
[source]
----
curl -v https://<inference_endpoint_url>:443/v1/chat/completions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also add an example with the /v1/completions endpoint. In that case

-d '{"model":"<model_name>", "prompt": "<text>"}' ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtrifiro AFAIK :443/v1/completions does not handle image url or anything in such, So based on that i don't think we can add completion endpoints

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's there and part of the OpenAI API spec, although it's being deprecated: https://platform.openai.com/docs/guides/completions

Since it should be equivalent to the chat API (abeit a bit simpler), I guess we can leave it out

----
curl -v https://<inference_endpoint_url>:443/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"
Copy link

@dtrifiro dtrifiro Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth it to mention that the authorization header is only required if adding --api-key in the vllm commandline arguments (InferenceService or ServingRuntime):

from vllm --help (or see t

--api-key API_KEY     If provided, the server will require this key to be
                        presented in the header.

See https://docs.vllm.ai/en/v0.5.4/serving/openai_compatible_server.html and/or https://docs.vllm.ai/en/v0.5.4/serving/env_vars.html#environment-variables

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. This authorization is intended for Authernio token, not for the vLLM API key. We don’t provide documentation for the API key on our VLLM server, so I don’t think it needs to be included here api key in our doc and have the doc as it is.

----
+
Replace `<NUM_SPECULATIVE_TOKENS>` and `<NGRAM_PROMPT_LOOKUP_MAX>` with your own values.
. To configure the vLLM model-serving runtime for speculative decoding with a draft model, do the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. To configure the vLLM model-serving runtime for speculative decoding with a draft model, do the following:
. To configure the vLLM model-serving runtime for speculative decoding with a draft model:


You can configure the *vLLM ServingRuntime for KServe* runtime to use speculative decoding, a parallel processing technique to optimize inferencing time for Large Language Models (LLMs).

You can also configure the runtime to support inferencing for Vision-Language modals (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can also configure the runtime to support inferencing for Vision-Language modals (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.
You can also configure the runtime to support inferencing for vision-language models (VLMs). VLMs are a subset of multi-modal models that integrate both visual and textual data.

----
+
Replace `<path_to_speculative_model>` and `<path_to_original_model>` with the paths to the speculative model and original model on your S3-compatible object storage. Replace all other placeholder values with your own.
. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments as shown:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments as shown:
. To configure the vLLM model-serving runtime for multi-modal inferencing, add the following arguments:

-H "Content-Type: application/json"
-H "Authorization: Bearer <token>"
----
* If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the Vision-Language Model (VLM) that you have deployed:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the Vision-Language Model (VLM) that you have deployed:
* If you have configured the vLLM model-serving runtime for multi-modal inferencing, use the following example command to verify API requests to the vision-language model (VLM) that you have deployed:

Copy link

@tarukumar tarukumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let wait for @dtrifiro feedback on the open comment

@syaseen-rh syaseen-rh merged commit ffcfb03 into opendatahub-io:main Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants