Skip to content

Commit

Permalink
DOC: update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
loganthomas committed Jan 21, 2025
1 parent 2758568 commit 69b79e1
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 32 deletions.
66 changes: 34 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,32 @@
# mlb-pitcher-xK

## Problem
The provided `k.csv` file in `data/` contains only eight columns:
## Overview
`mlb-pitcher-xK` is a data science and machine learning project
focused on predicting MLB pitchers' strikeout percentages (`K%`)
for the 2024 season. Using historical pitching data, feature engineering,
and statistical modeling, this project aims to derive insights into
pitcher performance trends while emphasizing reproducibility
and employing best practices in data science.

## Problem Statement
The provided dataset in `data/k.csv` contains only eight columns:
1. `MLBAMID`: player's MLB ID
1. `PlayerId`: player's FanGraphs ID
1. `Name`: player's name
1. `Team`: player's team name (**NOTE**: `" - - -"` if the player played on multiple teams in a season)
1. `Team`: player's team name (_NOTE_: `" - - -"` indicates the player played for multiple teams in a season)
1. `Age`: player's age in 2024 season
1. `Season`: season year
1. `TBF`: Total batters faced for this player-season
1. `K%`: Strikeout percentage for this player-season

**Instructions:**
Predict each player's strikeout percentage in the 2024 season
given his total batters faced and strikeout percentage in prior seasons.
You are definitely not required to do so, but if you wish, you are welcome to research
and incorporate more data than what is provided in the `k.csv` file for your predictions.
However, **you may not include any data from Opening Day for the 2024 regular season onward**.
For example, you may include each player's fastball velocity in prior seasons if you can find that data
and think it will help your prediction, but you cannot include each player's 2024 fastball velocity.
Data in `k.csv` came from fangraphs.com. Your analysis should be completely reproducible.
1. `TBF`: Total batters faced for the player-season
1. `K%`: Strikeout percentage for the player-season

**Objective**: Predict each player’s `K%` for the 2024 season using historical `K%` and other derived features. The analysis excludes any data from Opening Day 2024 onward.

## Results
A linear model was fit using the provided `k.csv` data and supplemental data found at [Baseball Reference Pitcher Data](https://www.baseball-reference.com/leagues/majors/2024-pitches-pitching.shtml). The model used `7` features:
A linear regression model was developed using:
- Provided data (`k.csv`): historical `K%` and `TBF` values
- Supplemental data: Scraped statistics from [Baseball Reference Pitcher Data](https://www.baseball-reference.com/leagues/majors/2024-pitches-pitching.shtml), including advanced metrics like strike percentages and contact rates.

### Key Features Used by the Model
- `I/Str`: ball in play percentage (balls put into play including hr / total strikes)
- `Pit/PA`: pitches per plate appearance
- `Con`: contact percentage ((foul + inplay strikes) / (inplay + foul + swinging strikes))
Expand All @@ -32,7 +35,6 @@ A linear model was fit using the provided `k.csv` data and supplemental data fou
- `F/Str`: foul ball strike percentage (pitches fouled off / total strikes seen)
- `Str%`: strike percentage (strikes / total pitches; intentional balls included)

The coefficients (weights) of the model are below:
| feature | coef |
|:----------|-------------:|
| I/Str | -0.0528688 |
Expand All @@ -43,11 +45,14 @@ The coefficients (weights) of the model are below:
| F/Str | -0.00169988 |
| Str% | -0.000350969 |

The model performed well when comparing the actual strike out percentage (`K%`) to the expected strike out percentage (`xK%`):
### Model Performance
The model effectively predicts `xK%` (expected strikeout percentage), as demonstrated by the correlation between predicted and actual `K%`:

![image](assets/images/linear-pred-vs-target.png)

### For an interactive plot, see [assets/images/linear-pred-vs-target.html](assets/images/linear-pred-vs-target.html)
### For an interactive visualization, see [assets/images/linear-pred-vs-target.html](assets/images/linear-pred-vs-target.html):

![image](assets/images/linear-pred-vs-target-html-static.png)

A few cool plots based on the predictions:

Expand All @@ -63,15 +68,16 @@ A few cool plots based on the predictions:
- [Baseball Reference Pitcher Data](https://www.baseball-reference.com/leagues/majors/2014-pitches-pitching.shtml)

## Development Process
Development was performed in Jupyter notebooks (see the [notebooks/](./notebooks) directory).
An accompanying package, `bullpen`, was created to take the final state of code from the notebooks
and convert it to source code (see the [src/bullpen/](./src/bullpen/) directory).
All analysis and modeling were conducted in Jupyter notebooks (see the [notebooks/](./notebooks) directory). The final code was refactored into a Python package, `bullpen`, for modularity and reproducibility (see [src/bullpen/](./src/bullpen/)). Key development steps include:
1. **Data Scraping**: Extracted supplemental data from Baseball Reference using `bullpen.data_utils`.
2. **Data Cleaning & Integration**: Processed and merged supplemental data with `k.csv`.
3. **Feature Engineering**: Created data processing pipelines for scaling and one-hot encoding features using `bullpen.model_utils`.
4. Modeling: Trained and validated models using:
- Classic cross-validation (utilizing `sklearn.GridSearchCV`)
- Time-series cross-validation (implemented custom time splitting training loop)

> [!IMPORTANT]
> These are likely the files you want to look at to familiarize yourself with the analysis.
> Sometimes the interactive plots don't render on GitHub.
> If that is the case, copy and paste the notebook url into [nbviewer](https://nbviewer.org/) to see the full notebook in all its glory.
- [00-data-scrape-example.ipynb](./notebooks/00-data-scrape-example.ipynb)
- [01a-data-processing-fixing-names.ipynb](./notebooks/01a-data-processing-fixing-names.ipynb)
- [01b-data-processing-merging.ipynb](./notebooks/01b-data-processing-merging.ipynb)
Expand All @@ -81,6 +87,9 @@ and convert it to source code (see the [src/bullpen/](./src/bullpen/) directory)
- [04b-modeling-time-series-cv.ipynb](./notebooks/04b-modeling-time-series-cv.ipynb)
- [05-final-predictions.ipynb](./notebooks/05-final-predictions.ipynb)

> [!TIP]
> Sometimes the interactive plots don't render on GitHub.
> If that is the case, use [nbviewer](https://nbviewer.org/) for an enhanced notebook viewing experience.
## Scraping Supplementary Pitching Data from Baseball Reference

Expand Down Expand Up @@ -276,13 +285,6 @@ $ tree
│   └── The Definitive Pitcher Expected K% Formula _ RotoGraphs Fantasy Baseball.pdf
├── assets
│   └── images
│   ├── gray-pred.png
│   ├── grid_search_cross_validation.png
│   ├── linear-pred-vs-target.html
│   ├── linear-pred-vs-target.png
│   ├── musgrove-pred.png
│   ├── time-series-cv.png
│   └── wainwright-pred.png
├── data
│   ├── k.csv
│   ├── player_ids.json
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 69b79e1

Please sign in to comment.