BGG Models

bgg_models is a project for training predictive models using data from boardgamegeek (bgg).

these models predict how the bgg community as a whole will evaluate games (average rating, number of user ratings, complexity). models are trained on game and collection data pulled via BoardGameGeek's API and stored via GoogleCloudStorage

graph LR
    %% Data Sources and Storage
    BGG[BoardGameGeek API] -->|Data Collection| GCS[(Google Cloud Storage)]
    GCS -->|Data Loading| PrepData[Preprocessed Games Data]
    
    %% Data Splitting
    PrepData -->|Time-based Split| DataSets[Training/Validation/Testing Sets]
    
    %% Main Model Components
    DataSets -->|Training| Models[Model Training]
    
    %% Model Types
    Models --> HurdleModel[Hurdle Model<br>Classification]
    Models --> WeightModel[Complexity Model<br>Regression]
    Models --> RatingModels[Rating Models<br>Regression]
    
    %% Evaluation and Deployment
    HurdleModel --> Evaluation[Model Evaluation]
    WeightModel --> Evaluation
    RatingModels --> Evaluation
    
    Evaluation --> Tracking[Performance Tracking]
    Evaluation --> Deployment[Model Deployment]
    
    %% Prediction
    Deployment --> VetiverModels[(Deployed Models<br>in GCS)]
    NewGames[New Games] --> Predictions[Predictions]
    VetiverModels --> Predictions
    
    %% Styling
    classDef storage fill:#f9f,stroke:#333,stroke-width:2px;
    classDef model fill:#bbf,stroke:#333,stroke-width:1px;
    classDef data fill:#dfd,stroke:#333,stroke-width:1px;
    
    class GCS,VetiverModels storage;
    class HurdleModel,WeightModel,RatingModels,Models model;
    class PrepData,DataSets,NewGames,Predictions data;

Project Overview

Data Management

Data Collection and Storage:
- Game data is pulled from BoardGameGeek's API using the bggUtils package
- Data is stored in Google Cloud Storage buckets configured in config.yml
- Authentication to GCS is handled via authenticate_to_gcs() function
Data Preprocessing:
- Raw game data is processed using prepare_games() function
- Processing includes filtering publishers, handling families, and removing games with missing publication years
- The add_hurdle() function creates a binary target variable for classification
Data Splitting Strategy:
- Time-based split using the create_year_split() function
- Training: Games published up to end_train_year (set to 2021 in the pipeline)
- Validation: Games published in the next valid_years after end_train_year (set to 2 years)
- Testing: Games published after end_train_year + valid_years
- This approach simulates real-world prediction scenarios where models predict ratings for future games

Modeling Strategy

Multi-Stage Modeling Approach:
- Hurdle Model: A classification model using LightGBM that predicts whether a game will reach a minimum threshold of user ratings (25 ratings)
- Complexity Model (averageweight): A regression model using LightGBM that predicts game complexity on a scale of 1-5
- Average Rating Model: A regression model using glmnet (elastic net) that predicts the average user rating
- Users Rated Model: A regression model using glmnet that predicts the number of user ratings a game will receive
Model Dependencies and Flow:
- The averageweight model is trained first and used to impute missing complexity values via impute_averageweight()
- The imputed data is then used to train the average and usersrated models
- The hurdle model acts as a filter to predict which games will receive enough ratings
- The average and usersrated models are combined to calculate the bayesaverage (BGG's weighted rating system) via calculate_bayesaverage()
- This multi-model approach handles the complex relationships between game attributes and community reception
Feature Engineering:
- Recipes for different model types are defined in recipe_linear(), recipe_trees(), and recipe_hurdle()
- Features include game attributes (players, playtime, age), categories, mechanics, publishers, designers, etc.
- Text features from game descriptions are processed (word count, publisher description detection)
- Spline transformations are applied to capture non-linear relationships
Model Tuning and Hyperparameters:
- Models are tuned using grid search with cross-validation
- LightGBM models tune tree depth, learning rate, and minimum node size
- Glmnet models tune regularization penalty and mixture (elastic net ratio)
- Best models are selected based on performance metrics (RMSE, ROC AUC, etc.)

Evaluation and Deployment

Model Assessment:
- Models are evaluated on validation data using functions in assess.R
- Classification metrics include F1, F2, precision, recall, and balanced accuracy
- Regression metrics include RMSE, MAE, MAPE, R², and concordance correlation coefficient
- Results are visualized using functions in visualizations/models.R
Performance Tracking:
- Model performance is tracked over time with metrics stored in CSV files
- The write_tracking() function in tracking.R saves metrics with timestamps and user information
- This enables monitoring of model drift and performance improvements
Model Deployment:
- Final models are trained on combined training and validation data
- Models are deployed using the vetiver package via pin_outcome_model()
- Models are stored in Google Cloud Storage with metadata including performance metrics
- The targets package manages the entire pipeline, ensuring reproducibility

Predictions

Making Predictions:
- New games can be processed through the same preprocessing pipeline
- The predict_outcomes() function applies all models in sequence:
  1. Predict complexity with the averageweight model
  2. Apply the hurdle model to determine if the game will get enough ratings
  3. Predict average rating and number of user ratings
  4. Calculate the expected bayesaverage (weighted rating)
- This provides a comprehensive prediction of how the BGG community will receive a new game
Visualization and Reporting:
- Results are visualized using custom plotting functions
- Quarto is used for generating reports and documentation
- The project includes functions for creating interactive tables of predictions

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github/workflows		.github/workflows
_targets		_targets
notebooks		notebooks
renv		renv
src		src
targets-runs		targets-runs
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
_quarto.yml		_quarto.yml
_targets.R		_targets.R
_targets_packages.R		_targets_packages.R
bgg_models.Rproj		bgg_models.Rproj
config.yml		config.yml
index.qmd		index.qmd
methodology.qmd		methodology.qmd
predictions.qmd		predictions.qmd
render.sh		render.sh
renv.lock		renv.lock
run_pipeline.sh		run_pipeline.sh
styles.css		styles.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BGG Models

Project Overview

Data Management

Modeling Strategy

Evaluation and Deployment

Predictions

About

Releases 5

Packages

Languages

phenrickson/bgg_models

Folders and files

Latest commit

History

Repository files navigation

BGG Models

Project Overview

Data Management

Modeling Strategy

Evaluation and Deployment

Predictions

About

Topics

Resources

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages