opendatahub-io · eturner24 · Jul 2, 2024 · Jun 26, 2024 · Jun 27, 2024 · Jun 27, 2024
diff --git a/modules/about-the-single-model-serving-platform.adoc b/modules/about-the-single-model-serving-platform.adoc
@@ -30,6 +30,7 @@ When you have installed KServe, you can use the {productname-short} dashboard to
 
 * *TGIS Standalone ServingRuntime for KServe*: A runtime for serving TGI-enabled models
 * *Caikit-TGIS ServingRuntime for KServe*: A composite runtime for serving models in the Caikit format
+* *Caikit Standalone ServingRuntime for KServe*: A runtime for serving models in the Caikit embeddings format for embeddings tasks
 * *OpenVINO Model Server*: A scalable, high-performance runtime for serving models that are optimized for Intel architectures
 * *vLLM ServingRuntime for KServe*: A high-throughput and memory-efficient inference and serving runtime for large language models
 
@@ -39,6 +40,8 @@ ifdef::upstream[]
 * link:https://github.com/IBM/text-generation-inference[Text Generation Inference Server (TGIS)^] is based on an early fork of link:https://github.com/huggingface/text-generation-inference[Hugging Face TGI^]. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model does not work in the current version of {productname-short}, support might be added in a future version. In the meantime, you can also add your own, custom runtime to support a TGI model. For more information, see link:{odhdocshome}/serving-models/#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Adding a custom model-serving runtime for the single-model serving platform].
 
 * The composite Caikit-TGIS runtime is based on link:https://github.com/opendatahub-io/caikit[Caikit^] and link:https://github.com/IBM/text-generation-inference[Text Generation Inference Server (TGIS)^]. To use this runtime, you must convert your models to Caikit format. For an example, see link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/built-tip.md#bootstrap-process[Converting Hugging Face Hub models to Caikit format^] in the link:https://github.com/opendatahub-io/caikit-tgis-serving/tree/main[caikit-tgis-serving^] repository.
+
+* The Caikit Standalone runtime is based on link:https://github.com/caikit/caikit-nlp/tree/main[Caikit NLP^]. To use this runtime, you must convert your models to the Caikit embeddings format. For an example, see link:https://github.com/markstur/caikit-embeddings/blob/df9c9bc93187c0a17cb66b86d609f2cd102be97d/demo/server/bootstrap_model.py[Bootstrap Model^].
 ====
 endif::[]
 
@@ -48,6 +51,8 @@ ifndef::upstream[]
 * link:https://github.com/IBM/text-generation-inference[Text Generation Inference Server (TGIS)^] is based on an early fork of link:https://github.com/huggingface/text-generation-inference[Hugging Face TGI^]. Red Hat will continue to develop the standalone TGIS runtime to support TGI models. If a model does not work in the current version of {productname-short}, support might be added in a future version. In the meantime, you can also add your own, custom runtime to support a TGI model. For more information, see link:{rhoaidocshome}{default-format-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Adding a custom model-serving runtime for the single-model serving platform].
 
 * The composite Caikit-TGIS runtime is based on link:https://github.com/opendatahub-io/caikit[Caikit^] and link:https://github.com/IBM/text-generation-inference[Text Generation Inference Server (TGIS)^]. To use this runtime, you must convert your models to Caikit format. For an example, see link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/built-tip.md#bootstrap-process[Converting Hugging Face Hub models to Caikit format^] in the link:https://github.com/opendatahub-io/caikit-tgis-serving/tree/main[caikit-tgis-serving^] repository.
+
+* The Caikit Standalone runtime is based on link:https://github.com/caikit/caikit-nlp/tree/main[Caikit NLP^]. To use this runtime, you must convert your models to the Caikit embeddings format. For an example, see link:https://github.com/markstur/caikit-embeddings/blob/df9c9bc93187c0a17cb66b86d609f2cd102be97d/demo/server/bootstrap_model.py[Bootstrap Model^].
 ====
 endif::[]
 

diff --git a/...ing-inference-endpoint-for-model-deployed-on-single-model-serving-platform.adoc b/...ing-inference-endpoint-for-model-deployed-on-single-model-serving-platform.adoc
@@ -31,6 +31,36 @@ The inference endpoint for the model is shown in the *Inference endpoint* field.
 // * `:443/api/v1/task/text-classification`
 // * `:443/api/v1/task/token-classification`
 
+*Caikit Standalone ServingRuntime for KServe*
+
+.REST endpoints
+* `/api/v1/task/embedding`
+* `/api/v1/task/embedding-tasks`
+* `/api/v1/task/sentence-similarity`
+* `/api/v1/task/sentence-similarity-tasks`
+* `/api/v1/task/rerank`
+* `/api/v1/task/rerank-tasks`
+
+.gRPC endpoints
+* `:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict`
+* `:443 caikit.runtime.Nlp.NlpService/EmbeddingTasksPredict`
+* `:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTaskPredict`
+* `:443 caikit.runtime.Nlp.NlpService/SentenceSimilarityTasksPredict`
+* `:443 caikit.runtime.Nlp.NlpService/RerankTaskPredict`
+* `:443 caikit.runtime.Nlp.NlpService/RerankTasksPredict`
++
+ifdef::upstream[]
+NOTE: By default, the Caikit Standalone Runtime exposes REST endpoints for use. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see link:{odhdocshome}/serving-models/#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Adding a custom model-serving runtime for the single-model serving platform]. 
+
+An example manifest is available in the link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/custom-manifests/caikit/caikit-standalone/caikit-standalone-servingruntime-grpc.yaml[caikit-tgis-serving GitHub repository^].
+endif::[]
+
+ifndef::upstream[]
+NOTE: By default, the Caikit Standalone Runtime exposes REST endpoints for use. To use gRPC protocol, manually deploy a custom Caikit Standalone ServingRuntime. For more information, see link:{rhoaidocshome}{default-format-url}/serving_models/serving-large-models_serving-large-models#adding-a-custom-model-serving-runtime-for-the-single-model-serving-platform_serving-large-models[Adding a custom model-serving runtime for the single-model serving platform].
+
+An example manifest is available in the link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/custom-manifests/caikit/caikit-standalone/caikit-standalone-servingruntime-grpc.yaml[caikit-tgis-serving GitHub repository^].
+endif::[]
+
 *TGIS Standalone ServingRuntime for KServe*
 
 * `:443 fmaas.GenerationService/Generate`
@@ -51,15 +81,17 @@ NOTE: To query the endpoint for the TGIS standalone runtime, you must also downl
 * `:443/v1/completions`
 * `:443/v1/embeddings`
 +
-NOTE: The vLLM runtime is compatible with the OpenAI REST API. For a list of models supported by the vLLM runtime, see link:https://docs.vllm.ai/en/latest/models/supported_models.html[Supported models].
+NOTE: The vLLM runtime is compatible with the OpenAI REST API. For a list of models that the vLLM runtime supports, see link:https://docs.vllm.ai/en/latest/models/supported_models.html[Supported models].
 +
-NOTE: To use the embeddings inference endpoint in vLLM, you must use an embeddings model that is supported by vLLM. You cannot use the embeddings endpoint with generative models. For more information, see link:https://github.com/vllm-project/vllm/pull/3734[Supported embeddings models in vLLM].
+NOTE: To use the embeddings inference endpoint in vLLM, you must use an embeddings model that the vLLM supports. You cannot use the embeddings endpoint with generative models. For more information, see link:https://github.com/vllm-project/vllm/pull/3734[Supported embeddings models in vLLM].
 +
 
 As indicated by the paths shown, the single-model serving platform uses the HTTPS port of your OpenShift router (usually port 443) to serve external API requests.
 --
 
-. Use the endpoint to make API requests to your deployed model, as shown in the following example commands.
+. Use the endpoint to make API requests to your deployed model, as shown in the following example commands:
++
+NOTE: If you enabled token authorization when deploying the model, add the `Authorization` header and specify a token value.
 
 ifdef::upstream[]
 +
@@ -69,29 +101,40 @@ ifdef::upstream[]
 ----
 curl --json '{"model_id": "<model_name>", "inputs": "<text>"}' \
 https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation \
--H 'Authorization: Bearer <token>' <1>
+-H 'Authorization: Bearer <token>'
+----
+
+*Caikit Standalone ServingRuntime for KServe*
+
+.REST
+[source]
+----
+curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'
+----
+
+.gRPC
+[source]
+----
+grpcurl -insecure -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'
 ----
-<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.
 
 *TGIS Standalone ServingRuntime for KServe*
 [source]
 ----
 grpcurl -proto text-generation-inference/proto/generation.proto -d \
 '{"requests": [{"text":"<text>"}]}' \
 -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate \
--H 'Authorization: Bearer <token>' <1>
+-H 'Authorization: Bearer <token>'
 ----
-<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.
 
 *OpenVINO Model Server*
 [source]
 ----
 curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d \
 '{ "model_name": "<model_name>", \
 "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' \
--H 'Authorization: Bearer <token>' <1>
+-H 'Authorization: Bearer <token>'
 ----
-<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.
 
 *vLLM ServingRuntime for KServe*
 [source]
@@ -101,9 +144,8 @@ curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H \
 "messages": [{ \
 "role": "<role>", \
 "content": "<content>" \
-}] -H 'Authorization: Bearer <token>' <1>
+}] -H 'Authorization: Bearer <token>'
 ----
-<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.
 --
 endif::[]
 ifdef::self-managed,cloud-service[]
@@ -113,36 +155,46 @@ ifdef::self-managed,cloud-service[]
 *Caikit TGIS ServingRuntime for KServe*
 [source]
 ----
-curl --json '{"model_id": "<model_name>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'  <1>
+curl --json '{"model_id": "<model_name__>", "inputs": "<text>"}' https://<inference_endpoint_url>:443/api/v1/task/server-streaming-text-generation -H 'Authorization: Bearer <token>'
+----
+
+*Caikit Standalone ServingRuntime for KServe*
+.REST
+[source]
+----
+curl -H 'Content-Type: application/json' -d '{"inputs": "<text>", "model_id": "<model_id>"}' <inference_endpoint_url>/api/v1/task/embedding -H 'Authorization: Bearer <token>'
+----
+
+.gRPC
+[source]
+----
+grpcurl -insecure -d '{"text": "<text>"}' -H \"mm-model-id: <model_id>\" <inference_endpoint_url>:443 caikit.runtime.Nlp.NlpService/EmbeddingTaskPredict -H 'Authorization: Bearer <token>'
 ----
-<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.
 
 *TGIS Standalone ServingRuntime for KServe*
 [source]
 ----
-grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate  <1>
+grpcurl -proto text-generation-inference/proto/generation.proto -d '{"requests": [{"text":"<text>"}]}' -H 'Authorization: Bearer <token>' -insecure <inference_endpoint_url>:443 fmaas.GenerationService/Generate 
 ----
-<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.
 
 *OpenVINO Model Server*
 [source]
 ----
-curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'  <1>
+curl -ks <inference_endpoint_url>/v2/models/<model_name>/infer -d '{ "model_name": "<model_name>", "inputs": [{ "name": "<name_of_model_input>", "shape": [<shape>], "datatype": "<data_type>", "data": [<data>] }]}' -H 'Authorization: Bearer <token>'
 ----
-<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.
 
 *vLLM ServingRuntime for KServe*
 [source]
 ----
-curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>' <1>
+curl -v https://<inference_endpoint_url>:443/v1/chat/completions -H "Content-Type: application/json" -d '{ "messages": [{ "role": "<role>", "content": "<content>" }] -H 'Authorization: Bearer <token>'
 ----
-<1> You must add the `Authorization` header and specify a token value _only_ if you enabled token authorization when deploying the model.
 --
 endif::[]
 
 [role='_additional-resources']
 .Additional resources
 * link:https://github.com/IBM/text-generation-inference[Text Generation Inference Server (TGIS)^]
 * link:https://caikit.readthedocs.io/en/latest/autoapi/caikit/index.html[Caikit API documentation^]
+* link:https://github.com/markstur/caikit-embeddings[Caikit Text Embedding GitHub project^]
 * link:https://docs.openvino.ai/2023.3/ovms_docs_rest_api_kfs.html[OpenVINO KServe-compatible REST API documentation^]
 * link:https://platform.openai.com/docs/api-reference/introduction[OpenAI API documentation]
diff --git a/modules/enabling-the-single-model-serving-platform.adoc b/modules/enabling-the-single-model-serving-platform.adoc
@@ -30,6 +30,7 @@ endif::[]
 The *Serving runtimes* page shows any custom runtimes that you have added, as well as the following pre-installed runtimes:
 +
 ** *Caikit TGIS ServingRuntime for KServe*
+** *Caikit Standalone ServingRuntime for KServe*
 ** *OpenVINO Model Server*
 ** *TGIS Standalone ServingRuntime for KServe*
 ** *vLLM ServingRuntime for KServe*

diff --git a/modules/viewing-metrics-for-the-single-model-serving-platform.adoc b/modules/viewing-metrics-for-the-single-model-serving-platform.adoc
@@ -42,6 +42,13 @@ sum(increase(vllm:request_success_total{namespace='${namespace}',model_name='${m
 sum(increase(tgi_request_success{namespace=${namespace}, pod=~'${model_name}-predictor-.*'}[${rate_interval}]))
 ----
 
+.. The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:
++
+[source,subs="+quotes"]
+----
+sum(increase(predict_rpc_count_total{namespace='${namespace}',code='OK',model_id='${model_name}'}[${rate_interval}]))
+----
+
 .. The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:
 +
 [source,subs="+quotes"]
@@ -67,7 +74,14 @@ sum(increase(vllm:request_success_total{namespace='${namespace}',model_name='${m
 sum(increase(tgi_request_success{namespace=${namespace}, pod=~'${model_name}-predictor-.*'}[${rate_interval}]))
 ----
 
-.. The following query displays the number of successful inference requests over a periof of time for a model deployed with the OpenVINO Model Server runtime:
+.. The following query displays the number of successful inference requests over a period of time for a model deployed with the Caikit Standalone runtime:
++
+[source,subs="+quotes"]
+----
+sum(increase(predict_rpc_count_total{namespace='${namespace}',code='OK',model_id='${model_name}'}[${rate_interval}]))
+----
+
+.. The following query displays the number of successful inference requests over a period of time for a model deployed with the OpenVINO Model Server runtime:
 +
 [source,subs="+quotes"]
 ----