|
| 1 | +# Feature Serving and Model Inference |
| 2 | + |
| 3 | +Production machine learning systems can choose from four approaches to serving machine learning predictions (the output |
| 4 | +of model inference): |
| 5 | +1. Online model inference with online features |
| 6 | +2. Precomputed (batch) model predictions without online features |
| 7 | +3. Online model inference with online features and cached predictions |
| 8 | +4. Online model inference without features |
| 9 | + |
| 10 | +*Note: online features can be sourced from batch, streaming, or request data sources.* |
| 11 | + |
| 12 | +These three approaches have different tradeoffs but, in general, have significant implementation differences. |
| 13 | + |
| 14 | +## 1. Online Model Inference with Online Features |
| 15 | +Online model inference with online features is a powerful approach to serving data-driven machine learning applications. |
| 16 | +This requires a feature store to serve online features and a model server to serve model predictions (e.g., KServe). |
| 17 | +This approach is particularly useful for applications where request-time data is required to run inference. |
| 18 | +```python |
| 19 | +features = store.get_online_features( |
| 20 | + feature_refs=[ |
| 21 | + "user_data:click_through_rate", |
| 22 | + "user_data:number_of_clicks", |
| 23 | + "user_data:average_page_duration", |
| 24 | + ], |
| 25 | + entity_rows=[{"user_id": 1}], |
| 26 | +) |
| 27 | +model_predictions = model_server.predict(features) |
| 28 | +``` |
| 29 | + |
| 30 | +## 2. Precomputed (Batch) Model Predictions without Online Features |
| 31 | +Typically, Machine Learning teams find serving precomputed model predictions to be the most straightforward to implement. |
| 32 | +This approach simply treats the model predictions as a feature and serves them from the feature store using the standard |
| 33 | +Feast sdk. |
| 34 | +```python |
| 35 | +model_predictions = store.get_online_features( |
| 36 | + feature_refs=[ |
| 37 | + "user_data:model_predictions", |
| 38 | + ], |
| 39 | + entity_rows=[{"user_id": 1}], |
| 40 | +) |
| 41 | +``` |
| 42 | +Notice that the model server is not involved in this approach. Instead, the model predictions are precomputed and |
| 43 | +materialized to the online store. |
| 44 | + |
| 45 | +While this approach can lead to quick impact for different business use cases, it suffers from stale data as well |
| 46 | +as only serving users/entities that were available at the time of the batch computation. In some cases, this tradeoff |
| 47 | +may be tolerable. |
| 48 | + |
| 49 | +## 3. Online Model Inference with Online Features and Cached Predictions |
| 50 | +This approach is the most sophisticated where inference is optimized for low-latency by caching predictions and running |
| 51 | +model inference when data producers write features to the online store. This approach is particularly useful for |
| 52 | +applications where features are coming from multiple data sources, the model is computationally expensive to run, or |
| 53 | +latency is a significant constraint. |
| 54 | + |
| 55 | +```python |
| 56 | +# Client Reads |
| 57 | +features = store.get_online_features( |
| 58 | + feature_refs=[ |
| 59 | + "user_data:click_through_rate", |
| 60 | + "user_data:number_of_clicks", |
| 61 | + "user_data:average_page_duration", |
| 62 | + "user_data:model_predictions", |
| 63 | + ], |
| 64 | + entity_rows=[{"user_id": 1}], |
| 65 | +) |
| 66 | +if features.to_dict().get('user_data:model_predictions') is None: |
| 67 | + model_predictions = model_server.predict(features) |
| 68 | + store.write_to_online_store(feature_view_name="user_data", df=pd.DataFrame(model_predictions)) |
| 69 | +``` |
| 70 | +Note that in this case a seperate call to `write_to_online_store` is required when the underlying data changes and |
| 71 | +predictions change along with it. |
| 72 | + |
| 73 | +```python |
| 74 | +# Client Writes from the Data Producer |
| 75 | +user_data = request.POST.get('user_data') |
| 76 | +model_predictions = model_server.predict(user_data) # assume this includes `user_data` in the Data Frame |
| 77 | +store.write_to_online_store(feature_view_name="user_data", df=pd.DataFrame(model_predictions)) |
| 78 | +``` |
| 79 | +While this requires additional writes for every data producer, this approach will result in the lowest latency for |
| 80 | +model inference. |
| 81 | + |
| 82 | +## 4. Online Model Inference without Features |
| 83 | +This approach does not require Feast. The model server can directly serve predictions without any features. This |
| 84 | +approach is common in Large Language Models (LLMs) and other models that do not require features to make predictions. |
| 85 | + |
| 86 | +Note that generative models using Retrieval Augmented Generation (RAG) do require features where the |
| 87 | +[document embeddings](../../reference/alpha-vector-database.md) are treated as features, which Feast supports |
| 88 | +(this would fall under "Online Model Inference with Online Features"). |
0 commit comments