Author: Andriy Burkov
- Machine Learning Engineering
- Ch1. Introduction
- Ch2. Before the Project Starts
- Ch3. Data Collections and Preparation
- Ch4. Feature Engineering
- Ch5. Supervised Model Training (Part 1)
- Ch6. Supervised Model Training (Part 2)
- Ch7. Model Evaluation
- Ch8. Model Deployment
- Ch9. Model Serving, Monitoring, and Maintenance
- Ch10. Conclusion
When we deploy a model in production, we usually deploy an entire pipeline
Machine learning engineering (MLE):
- encompasses data collection, model training, making the model available for use
- includes any activity that lets ML algorithms be implemented as a part of an effective production system
ML Engineer:
- concerned with sourcing the data (from multiple locations), preprocessing it, programming features, training an effective model that will coexist in production with other processes
- stable, maintanable and easily accessible
- ML systems "fail silently" -> must be capable of preventing such failures or to know how to detect and handle them
Your problem:
- too complex for coding
- constantly changing
- perceptive (image, text, etc)
- unstudied phenomenon
- has a simple objective
- it is cost-effective
- explainability is needed
- errors are intolerable
- traditional SWE is a less expensive option
- all inputs and outputs can be enumerated and saved in a DB
- data is hard to get or too expensive
High when:
- ML can replace a complex part of your engineering project
- there's great benefit in getting inexpensive (but probably imperfect) predictions
Factors:
- difficulty of the problem
- cost of data
- need for accuracy
Progress in ML is nonlinear. Prediction error decreases fast in the beginning, but then gradually slows down
- Make sure the PO (or client) understands the constraints and risks
- Log every activity and track the time it took (helps with reporting and estimations of complexity in the future)
- lack of experienced talent
- lack of support by the leadership
- missing data infrastructure
- data labeling challenge
- siloed organizations and lack of collaboration
- technically infeasible projects
- lack of alignment between technical and business teams
- Data was randomized before the split
- Split was applied to raw data
- Validation and test sets follow the same distribution
- Leakage was avoided
- random sampling
- systematic sampling
- stratified sampling
- cluster sampling
Data versioning is a critical element in supervised learning when the labeling is done by multiple labelers
- what the data means
- how it was collected
- methods used to creat it
- details of train-validation-test splits
- details of all pre-processing steps
- explanation of any data that were excluded
- format used to store it
- types of attributes/features
- number of examples
- possible values for labels / allowable range for a numerical target
- high predictive power
- can be computed fast
- reliable
- uncorrelated
The distribution of feature values in the training set should be similar to the distribution of values the production model will receive
- Cutting the long tail
- Boruta
- L1 regularization
- scale features
- store and document in schema files or feature stores
- keep code, model and training data in sync
"Feature extraction code is one of the most important parts of a machine learning system. It must be extensively and systematically tested"
Baseline: a model or algorithm that provides a reference point for comparison. Establish a baseline performance on your problem before start working on a predictive model.
- simple learning algorithm or
- rule-based or heuristic algorithm (simple statistic)
- random prediction
- zero rule algorithm (e.g., always predict the most common class in the training set / average if regression)
- human baseline: Amazon Mechanical Turk (MT) service -> web-platform where people solve simple tasks for a reward
If the dataset can't be fully loaded in RAM -> incremental learning algorithms: can improve the model by reading data gradually (Naive Bayes, neural networks)
- Precision: ratio of true positive predictions to the overall number of positive PREDICTIONS
- Recall: ratio of true positive predictions to the overall number of positive EXAMPLES
- positive real beta
- beta = 2 -> weighs recall twice as high as precision
- beta = 0.5 -> weighs recall twice as low as precision
By varying the complexity of the model, we can reach the so-called "zone of solutions", a situation in which both bias and variance of the model are relatively low. The solution that optimizes the performance metric is usually found inside that zone
- estimate legal risks of putting the model in production
- understand the distribution of the data used to train the model
- evaluate the performance of the model prior to deployment
- monitor the performance of the deployed model
- A: served the old model
- B: served the new model
- apply a statistical significance test to decide whether the new model is statistically different from the old model
- start by randomly exposing all models to the users
- gradually reduce the exposure of the least-performing models until only one (the best performing) gets served most of the time
- technique (statistical procedure) to compute a statistical interval for any metric
- consists of building N samples of a dataset
- then training a model
- and computing some statistic using each of those N samples
- statically
- installable binary of the entire software
- positive: fast execution time for the user; don't have to upload user data to server (user privacy); can be called when the user is offline; keeping the model operation is user's responsibility
- negative: hard to upgrade model without upgrading whole app; may have messy computational requirements; difficult to monitor the model performance
- dynamically on the user's device
- similar to static (user runs part of the system on their device), but the model is not a part of the binary code of the app
- positive: better separation of concerns (easier to update); fast for the user (cheaper for the org's servers)
- negative: varies depending on strategy; difficult to monitor the model performance
- dynamically on a server:
- place the model on servers and make it available as REST API or gRPC service
- model streaming
- single: simplest -> serialize new model to file, replace the old one
- silent: new and old version runs in parallel during the switch
- canary: pushes new version to a small fraction of users, while keep the old one running for most
- multi-armed bandit (MAB): way to compare one or more versions of the model in the production env, and select the best performing one
"The model must be applied to the end-to-end and confidence test data by simulating a regular call from the outside"
- important consideration in model deployment
- you should only write your own code when it's absolutely necessary
- caching speeds up the application when it contains resource-consuming functions frequently called with the same parameter values
- secure
- correct
- ensures ease of deployment and recovery
- provides guarantees of model validity
- avoids training/serving skew and hidden feedback loops
- batch: when applied to big data and some latency is tolerable
- on-demand: wrapped into a REST API
- more training data made the model worse
- properties of the production data changed
- updated feature extraction code
- resource needed for feature changed/unavailable
- model is abused or under an adversarial attack
- automated value calculation for the performance metrics -> send alert if change significantly
- distribution shift
- numerical instability
- decreasing computational performance
- logs
"Most ML models must be regularly or occasionally updated"
how often?
- error rate / how critical
- only useful if fresh
- new training data available fast
- time it takes to retrain
- cost to train / deploy the model
- importance of update for improving the metrics