Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Caikit Standalone ServingRuntime #343

Merged
merged 13 commits into from
Jul 2, 2024
5 changes: 5 additions & 0 deletions modules/about-the-single-model-serving-platform.adoc
eturner24 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ When you have installed KServe, you can use the {productname-short} dashboard to

* *TGIS Standalone ServingRuntime for KServe*: A runtime for serving TGI-enabled models
* *Caikit-TGIS ServingRuntime for KServe*: A composite runtime for serving models in the Caikit format
* *Caikit Standalone ServingRuntime for KServe*: A runtime for serving models in the Caikit embeddings format for embeddings tasks
* *OpenVINO Model Server*: A scalable, high-performance runtime for serving models that are optimized for Intel architectures
* *vLLM ServingRuntime for KServe*: A high-throughput and memory-efficient inference and serving runtime for large language models

Expand All @@ -39,6 +40,8 @@ ifdef::upstream[]
* link:https://github.com/IBM/text-generation-inference[Text Generation Inference Server (TGIS)^] is based on an early fork of link:https://github.com/huggingface/text-generation-inference[Hugging Face TGI^]. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model does not work in the current version of {productname-short}, support might be added in a future version. In the meantime, you can also add your own, custom runtime to support a TGI model. For more information, see link:{odhdocshome}/serving-models/#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Adding a custom model-serving runtime for the single-model serving platform].

* The composite Caikit-TGIS runtime is based on link:https://github.com/opendatahub-io/caikit[Caikit^] and link:https://github.com/IBM/text-generation-inference[Text Generation Inference Server (TGIS)^]. To use this runtime, you must convert your models to Caikit format. For an example, see link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/built-tip.md#bootstrap-process[Converting Hugging Face Hub models to Caikit format^] in the link:https://github.com/opendatahub-io/caikit-tgis-serving/tree/main[caikit-tgis-serving^] repository.

* The Caikit Standalone runtime is based on link:https://github.com/caikit/caikit-nlp/tree/main[Caikit NLP^]. To use this runtime, you must convert your models to the Caikit embeddings format. For an example, see link:https://github.com/markstur/caikit-embeddings/blob/df9c9bc93187c0a17cb66b86d609f2cd102be97d/demo/server/bootstrap_model.py[Bootstrap Model^].
====
endif::[]

Expand All @@ -48,6 +51,8 @@ ifndef::upstream[]
* link:https://github.com/IBM/text-generation-inference[Text Generation Inference Server (TGIS)^] is based on an early fork of link:https://github.com/huggingface/text-generation-inference[Hugging Face TGI^]. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model does not work in the current version of {productname-short}, support might be added in a future version. In the meantime, you can also add your own, custom runtime to support a TGI model. For more information, see link:{rhoaidocshome}{default-format-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Adding a custom model-serving runtime for the single-model serving platform].

* The composite Caikit-TGIS runtime is based on link:https://github.com/opendatahub-io/caikit[Caikit^] and link:https://github.com/IBM/text-generation-inference[Text Generation Inference Server (TGIS)^]. To use this runtime, you must convert your models to Caikit format. For an example, see link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/built-tip.md#bootstrap-process[Converting Hugging Face Hub models to Caikit format^] in the link:https://github.com/opendatahub-io/caikit-tgis-serving/tree/main[caikit-tgis-serving^] repository.

* The Caikit Standalone runtime is based on link:https://github.com/caikit/caikit-nlp/tree/main[Caikit NLP^]. To use this runtime, you must convert your models to the Caikit embeddings format. For an example, see link:https://github.com/markstur/caikit-embeddings/blob/df9c9bc93187c0a17cb66b86d609f2cd102be97d/demo/server/bootstrap_model.py[Bootstrap Model^].
====
endif::[]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,36 @@ The inference endpoint for the model is shown in the *Inference endpoint* field.
// * `:443/api/v1/task/text-classification`
// * `:443/api/v1/task/token-classification`

*Caikit Standalone ServingRuntime for KServe*

eturner24 marked this conversation as resolved.
Show resolved Hide resolved
eturner24 marked this conversation as resolved.
Show resolved Hide resolved
.REST endpoints
* `/api/v1/task/embedding`
* `/api/v1/task/embedding-tasks`
* `/api/v1/task/sentence-similarity`
* `/api/v1/task/sentence-similarity-tasks`
* `/api/v1/task/rerank`
* `/api/v1/task/rerank-tasks`

.gRPC endpoints
* `:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict`
* `:443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict`
* `:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict`
* `:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict`
* `:443 caikit.runtime.Nlp.NlpService/RerankTaskPredict`
* `:443 caikit.runtime.Nlp.NlpService/RerankTasksPredict`
+
ifdef::upstream[]
NOTE: By default, the Caikit Standalone Runtime exposes REST endpoints for use. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see link:{odhdocshome}/serving-models/#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Adding a custom model-serving runtime for the single-model serving platform].

An example manifest is available in the link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/custom-manifests/caikit/caikit-standalone/caikit-standalone-servingruntime-grpc.yaml[caikit-tgis-serving GitHub repository^].
endif::[]

ifndef::upstream[]
NOTE: By default, the Caikit Standalone Runtime exposes REST endpoints for use. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see link:{rhoaidocshome}{default-format-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Adding a custom model-serving runtime for the single-model serving platform].

An example manifest is available in the link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/custom-manifests/caikit/caikit-standalone/caikit-standalone-servingruntime-grpc.yaml[caikit-tgis-serving GitHub repository^].
endif::[]

eturner24 marked this conversation as resolved.
Show resolved Hide resolved
*TGIS Standalone ServingRuntime for KServe*

* `:443 fmaas.GenerationService/Generate`
Expand All @@ -51,15 +81,17 @@ NOTE: To query the endpoint for the TGIS standalone runtime, you must also downl
* `:443/v1/completions`
* `:443/v1/embeddings`
+
NOTE: The vLLM runtime is compatible with the OpenAI REST API. For a list of models supported by the vLLM runtime, see link:https://docs.vllm.ai/en/latest/models/supported_models.html[Supported models].
NOTE: The vLLM runtime is compatible with the OpenAI REST API. For a list of models that the vLLM runtime supports, see link:https://docs.vllm.ai/en/latest/models/supported_models.html[Supported models].
+
NOTE: To use the embeddings inference endpoint in vLLM, you must use an embeddings model that is supported by vLLM. You cannot use the embeddings endpoint with generative models. For more information, see link:https://github.com/vllm-project/vllm/pull/3734[Supported embeddings models in vLLM].
NOTE: To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see link:https://github.com/vllm-project/vllm/pull/3734[Supported embeddings models in vLLM].
eturner24 marked this conversation as resolved.
Show resolved Hide resolved
+

As indicated by the paths shown, the single-model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.
--

. Use the endpoint to make API requests to your deployed model, as shown in the following example commands.
. Use the endpoint to make API requests to your deployed model, as shown in the following example commands:
+
NOTE: If you enabled token authorization when deploying the model, add the `Authorization` header and specify a token value.

ifdef::upstream[]
+
Expand All @@ -69,29 +101,40 @@ ifdef::upstream[]
----
curl --json '{"model_id": "<model_name>", "inputs": "<text>"}' \
https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation \
-H 'Authorization: Bearer <token>' <1>
-H 'Authorization: Bearer <token>'
----

*Caikit Standalone ServingRuntime for KServe*

.REST
[source]
----
curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'
----

.gRPC
[source]
----
grpcurl -insecure -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'
----
<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.

*TGIS Standalone ServingRuntime for KServe*
[source]
----
grpcurl -proto text-generation-inference/proto/generation.proto -d \
'{"requests": [{"text":"<text>"}]}' \
-insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate \
-H 'Authorization: Bearer <token>' <1>
-H 'Authorization: Bearer <token>'
----
<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.

*OpenVINO Model Server*
[source]
----
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d \
'{ "model_name": "<model_name>", \
"inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' \
-H 'Authorization: Bearer <token>' <1>
-H 'Authorization: Bearer <token>'
----
<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.

*vLLM ServingRuntime for KServe*
[source]
Expand All @@ -101,9 +144,8 @@ curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H \
"messages": [{ \
"role": "<role>", \
"content": "<content>" \
}] -H 'Authorization: Bearer <token>' <1>
}] -H 'Authorization: Bearer <token>'
----
<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.
--
endif::[]
ifdef::self-managed,cloud-service[]
Expand All @@ -113,36 +155,46 @@ ifdef::self-managed,cloud-service[]
*Caikit TGIS ServingRuntime for KServe*
[source]
----
curl --json '{"model_id": "<model_name>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>' <1>
curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'
----

*Caikit Standalone ServingRuntime for KServe*
.REST
[source]
----
curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'
----

.gRPC
[source]
----
grpcurl -insecure -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'
----
<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.

*TGIS Standalone ServingRuntime for KServe*
[source]
----
grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate <1>
grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate
----
<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.

*OpenVINO Model Server*
[source]
----
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>' <1>
curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
----
<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.

*vLLM ServingRuntime for KServe*
[source]
----
curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>' <1>
curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'
----
<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.
--
endif::[]

[role='_additional-resources']
.Additional resources
* link:https://github.com/IBM/text-generation-inference[Text Generation Inference Server (TGIS)^]
* link:https://caikit.readthedocs.io/en/latest/autoapi/caikit/index.html[Caikit API documentation^]
* link:https://github.com/markstur/caikit-embeddings[Caikit Text Embedding GitHub project^]
* link:https://docs.openvino.ai/2023.3/ovms_docs_rest_api_kfs.html[OpenVINO KServe-compatible REST API documentation^]
* link:https://platform.openai.com/docs/api-reference/introduction[OpenAI API documentation]
1 change: 1 addition & 0 deletions modules/enabling-the-single-model-serving-platform.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ endif::[]
The *Serving runtimes* page shows any custom runtimes that you have added, as well as the following pre-installed runtimes:
+
** *Caikit TGIS ServingRuntime for KServe*
** *Caikit Standalone ServingRuntime for KServe*
** *OpenVINO Model Server*
** *TGIS Standalone ServingRuntime for KServe*
** *vLLM ServingRuntime for KServe*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,13 @@ sum(increase(vllm:request_success_total{namespace='${namespace}',model_name='${m
sum(increase(tgi_request_success{namespace=${namespace}, pod=~'${model_name}-predictor-.*'}[${rate_interval}]))
----

.. The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:
+
[source,subs="+quotes"]
----
sum(increase(predict_rpc_count_total{namespace='${namespace}',code='OK',model_id='${model_name}'}[${rate_interval}]))
----

.. The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:
+
[source,subs="+quotes"]
Expand All @@ -67,7 +74,14 @@ sum(increase(vllm:request_success_total{namespace='${namespace}',model_name='${m
sum(increase(tgi_request_success{namespace=${namespace}, pod=~'${model_name}-predictor-.*'}[${rate_interval}]))
----

.. The following query displays the number of successful inference requests over a periof of time for a model deployed with the OpenVINO Model Server runtime:
.. The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:
+
[source,subs="+quotes"]
----
sum(increase(predict_rpc_count_total{namespace='${namespace}',code='OK',model_id='${model_name}'}[${rate_interval}]))
----

.. The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:
+
[source,subs="+quotes"]
----
Expand Down