feat: Updating docs to include model inference guidelines (feast-dev#4416)

franciscojavierarceo · web-flow · commit cebbe045597b · 2024-08-16T11:30:15.000+04:00
Signed-off-by: Francisco Javier Arceo &lt;farceo@redhat.com&gt;
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -23,6 +23,7 @@
   * [Push vs Pull Model](getting-started/architecture/push-vs-pull-model.md)
   * [Write Patterns](getting-started/architecture/write-patterns.md)
   * [Feature Transformation](getting-started/architecture/feature-transformation.md)
+  * [Feature Serving and Model Inference](getting-started/architecture/model-inference.md)
 * [Components](getting-started/components/README.md)
   * [Overview](getting-started/components/overview.md)
   * [Registry](getting-started/components/registry.md)
diff --git a/docs/getting-started/architecture/README.md b/docs/getting-started/architecture/README.md
@@ -19,3 +19,7 @@
 {% content-ref url="feature-transformation.md" %}
 [feature-transformation.md](feature-transformation.md)
 {% endcontent-ref %}
+
+{% content-ref url="model-inference.md" %}
+[model-inference.md](model-inference.md)
+{% endcontent-ref %}
diff --git a/docs/getting-started/architecture/feature-transformation.md b/docs/getting-started/architecture/feature-transformation.md
@@ -3,6 +3,7 @@
 A *feature transformation* is a function that takes some set of input data and
 returns some set of output data. Feature transformations can happen on either raw data or derived data.
 
+## Feature Transformation Engines
 Feature transformations can be executed by three types of "transformation engines":
 
 1. The Feast Feature Server
diff --git a/docs/getting-started/architecture/model-inference.md b/docs/getting-started/architecture/model-inference.md
@@ -0,0 +1,88 @@
+# Feature Serving and Model Inference
+
+Production machine learning systems can choose from four approaches to serving machine learning predictions (the output 
+of model inference):
+1. Online model inference with online features
+2. Precomputed (batch) model predictions without online features
+3. Online model inference with online features and cached predictions
+4. Online model inference without features 
+
+*Note: online features can be sourced from batch, streaming, or request data sources.*
+
+These three approaches have different tradeoffs but, in general, have significant implementation differences. 
+
+## 1. Online Model Inference with Online Features
+Online model inference with online features is a powerful approach to serving data-driven machine learning applications.
+This requires a feature store to serve online features and a model server to serve model predictions (e.g., KServe).
+This approach is particularly useful for applications where request-time data is required to run inference.
+```python
+features = store.get_online_features(
+    feature_refs=[
+        "user_data:click_through_rate",
+        "user_data:number_of_clicks",
+        "user_data:average_page_duration",
+    ],
+    entity_rows=[{"user_id": 1}],
+)
+model_predictions = model_server.predict(features)
+```
+
+## 2. Precomputed (Batch) Model Predictions without Online Features
+Typically, Machine Learning teams find serving precomputed model predictions to be the most straightforward to implement.
+This approach simply treats the model predictions as a feature and serves them from the feature store using the standard
+Feast sdk.
+```python
+model_predictions = store.get_online_features(
+    feature_refs=[
+        "user_data:model_predictions",
+    ],
+    entity_rows=[{"user_id": 1}],
+)
+```
+Notice that the model server is not involved in this approach. Instead, the model predictions are precomputed and 
+materialized to the online store.
+
+While this approach can lead to quick impact for different business use cases, it suffers from stale data as well
+as only serving users/entities that were available at the time of the batch computation. In some cases, this tradeoff
+may be tolerable.
+
+## 3. Online Model Inference with Online Features and Cached Predictions
+This approach is the most sophisticated where inference is optimized for low-latency by caching predictions and running 
+model inference when data producers write features to the online store. This approach is particularly useful for 
+applications where features are coming from multiple data sources, the model is computationally expensive to run, or 
+latency is a significant constraint.
+
+```python
+# Client Reads
+features = store.get_online_features(
+    feature_refs=[
+        "user_data:click_through_rate",
+        "user_data:number_of_clicks",
+        "user_data:average_page_duration",
+        "user_data:model_predictions",
+    ],
+    entity_rows=[{"user_id": 1}],
+)
+if features.to_dict().get('user_data:model_predictions') is None:
+    model_predictions = model_server.predict(features)
+    store.write_to_online_store(feature_view_name="user_data", df=pd.DataFrame(model_predictions))
+```
+Note that in this case a seperate call to `write_to_online_store` is required when the underlying data changes and 
+predictions change along with it.
+
+```python
+# Client Writes from the Data Producer
+user_data = request.POST.get('user_data')
+model_predictions = model_server.predict(user_data) # assume this includes `user_data` in the Data Frame
+store.write_to_online_store(feature_view_name="user_data", df=pd.DataFrame(model_predictions))
+```
+While this requires additional writes for every data producer, this approach will result in the lowest latency for 
+model inference.
+
+## 4. Online Model Inference without Features
+This approach does not require Feast. The model server can directly serve predictions without any features. This 
+approach is common in Large Language Models (LLMs) and other models that do not require features to make predictions. 
+
+Note that generative models using Retrieval Augmented Generation (RAG) do require features where the 
+[document embeddings](../../reference/alpha-vector-database.md) are treated as features, which Feast supports 
+(this would fall under "Online Model Inference with Online Features").
diff --git a/docs/getting-started/architecture/overview.md b/docs/getting-started/architecture/overview.md
@@ -8,9 +8,10 @@ Feast's architecture is designed to be flexible and scalable. It is composed of
 online store. 
 This allows Feast to serve features in real-time with low latency.
 
-* Feast supports On Demand and Streaming Transformations for [feature computation](feature-transformation.md) and
-  will support Batch transformations in the future. For Streaming and Batch, Feast requires a separate Feature Transformation
-  Engine (in the batch case, this is typically your Offline Store). We are exploring adding a default streaming engine to Feast.
+* Feast supports [feature transformation](feature-transformation.md) for On Demand and Streaming data sources and
+  will support Batch transformations in the future. For Streaming and Batch data sources, Feast requires a separate 
+[Feature Transformation Engine](feature-transformation.md#feature-transformation-engines) (in the batch case, this is 
+typically your Offline Store). We are exploring adding a default streaming engine to Feast.
 
 * Domain expertise is recommended when integrating a data source with Feast understand the [tradeoffs from different
   write patterns](write-patterns.md) to your application