This project is an attempt to find the effect that a hitter's sprint speed has on the statistic Isolated Power (ISO). This repository includes notebooks with cleaning and exploratory analysis of imported Statcast from pybaseball, application of unsupervised machine learning techniques, light feature engineering, and the use of several different supervised machine learning algorithms to perform multi-class classification on an imbalanced dataset of singles, doubles, and triples. Some of these methods include probability calibrated Naive Bayes and Support Vector Classifiers, a Tensorflow neural network, and Weighted/Balanced Random Forests. All models were evaluated on several classification metrics, including precision, recall, ROC-AUC, and the recall of each individual class. The methods for cleaning data, plotting visualizations for model performance, or implementing model towards individual predictions are included in the 'Scripts' folder. All exploratory analyses of data are located in the 'Analysis Notebooks' section, and the use of supervised machine learning classifiers to preprocess data and train/test models are in all folders containing models. A greater in-depth explanation of the code can be found at my Medium page.
Much of this project's inspiration has been inspired by Corbin Carroll, outfielder for the Arizona Diamondbacks. His ability to stretch hits that for nearly everyone else are singles into doubles, and from doubles to triples, caused me to wonder how his sprint speed could affect his ISO based on how often he could take an extra base at times where other hitters may not be able to.
- Python
- Python Packages: pandas, numpy, datetime, sklearn, imblearn, TensorFlow, matplotlib, seaborn, pybaseball
Component | Description |
---|---|
Analysis Notebooks | Jupyter Notebooks containing exploratory and unsupervised ML analysis of cleaned data. Unsup. ML methods included are Principal Component Analysis (PCA) and Gaussian Mixture Models (GMM). |
First Pass Models | Jupyter Notebooks containing first set of instantiated models using first set of features. Models Used: (Probability Calibrated) Naive Bayes, SVC, and AdaBoost, as well as a Keras Sequential Network and Bagged Decision Tress |
Scripts | Python scripts containing functions and code, imported into and ran inside of Jupyter Notebooks to reduce clutter |
Second Pass Models | Jupyter Notebooks containing second set of instantiated models after unsupervised ML performed on cleaned data, first set of features. All models tested are the same, but all models are trained on data with PCA and undersampling of non-minority classes applied. |
Third Pass Model | Contains Jupter Notebooks of new and final set of models trained on preprocessed data plus predictions for a set of individual players, as well as CSV of Statcast sprint speeds from 2015-2023. Models Used: (Class) Balanced Random Forest, Weighted Random Forest (Class Weight = Balanced) |
Visualizations | Includes relevant visuals from background research on probability calibration, generated figures from EDA, unsupervised ML, and visuals for model performance and predictions |
-
First Pass Performance: Visualizations of First Set of Models
- conf_matx.png- Confusion Matrices for each model
- metrics.png- Metrics DataFrame
- proba_dist.png- Probability distribution of all classes for each model
- roc_auc.png- ROC-AUC and Precision-Recall curves for each model
-
Second Pass Performance: Visualizations of Second Set of Models
- conf_matx.png- Confusion Matrices for each model
- metrics.png- Metrics DataFrame
- proba_dist.png- Probability distribution of all classes for each model
- roc_auc.png- ROC-AUC and Precision-Recall curves for each model
-
Third Pass Performance: Visualizations of Third Set of Models
- Final Model
- error.png- Scatterplot of 2022 players' True ISO vs. predicted ISO
- residuals.png- Residual Plot of Predicted ISO values vs. True ISO minus Predicted ISO
- residuals_qualified.png- For all qualified 2022 players, plot of predicted ISO with sprint speed +/ 1.5 ft/sec minus predicted ISO at true sprint speed (Maximum and Miniumums at +0.020 and ~-0.035 points of predicted ISO)
- Individual Residuals 2022: Residuals of predicted ISO with sprint speed +/ 1.5 ft/sec minus predicted ISO with no change to sprint speed
- Players: Ronald Acuña Jr., Mookie Betts, Alex Bregman, Corbin Carroll, Starling Marte, Kyle Schwarber, Giancarlo Stanton, Bobby Witt Jr.
- conf_matx.png- Confusion Matrices for each model
- metrics.png- Metrics DataFrame
- proba_dist.png- Probability distribution of all classes for each model
- roc_auc.png- ROC-AUC and Precision-Recall curves for each model
- speed_v_change.png- Scatterplot of 2022 qualified players, 2022 sprint speed vs. 2022 sprint speed minus maximum sprint speed of all individual sprint speeds 2015-2023
- Final Model
-
Hit_Type_Bar.png- Bar plot of hit type distributions for singles, doubles, triples
-
PCA_Imbalanced.png- 2D and 3D plots of cleaned data with PCA applied (2 and 3 component)
-
PCA_Undersampled.png- 2D and 3D plots of cleaned data with PCA and undersampling of non-minority clases applied (2 and 3 component)
-
calc_distance_error.png- Scatterplot of Statcast hit_distance_sc variable vs. personally calculated distance metric (see Exploratory notebook in Analysis Notebooks)
-
fig8_Niculescu-Mizil_Caruana.png: Figure 8 from Niculescu-Mizil and Caruana
-
reflection.gif- GIF, reflecting Corbin Carroll's 2022 hits from Statcast hit-coordinates right-side up
-
translation.gif- GIF, translating Corbin Carroll's 2022 hits from Statcast hit-coordinates (after reflection) to the origin