recommender

This repository aims for implementing various machine learning algorithms in recommender system with unified pipeline. Any PRs are warmly welcomed!

Setting up environment

Ensure that latest poetry version is installed.

$ poetry --version
Poetry (version 1.8.5)

Python version higher than 3.11 is required.

$ python3 --version
Python 3.11.11

Make virtual environment using poetry.

$ poetry shell

Install required packages from poetry.lock.

$ poetry install

Quick start

You can select model that you want to train and set appropriate loss function.

Let's run singular value decomposition on movielens 1m dataset using our pipeline.

Download dataset

Dataset will be downloaded in recommender/.data/movielens.

$ python3 scripts/download/movielens.py --package ml-1m

Run pipeline

There are two kinds of main scripts depending on model type.

torch based model: recommender/train.py
csr based model: recommender/train_csr.py

Because svd is torch based model, run recommender/train.py.

$ python3 recommender/train.py \
  --dataset movielens \
  --model svd \
  --loss mse \
  --epochs 30 \
  --num_factors 16 \
  --train_ratio 0.8 \
  --random_state 42 \
  --result_path "./result/svd"

Check experiment result

The results will be saved in ./result/svd where you can check figures, logs and model weights.

$ ls
log.log
map.png
model.pt
recall.png
validation_loss.pkl
loss.png
metric.pkl
ndcg.png
training_loss.pkl
weight.pt

Why made this repository?

There are lots of algorithms in recommender system starting from item-based recommendation, matrix factorization to deep-learning based recommendation.
To understand better, this repository provides implementation of various recommender algorithms using pytorch or custom learning methods.
Also, we not only offer implementation code but also pipelines for training recommender models with various implicit / explicit data (movielens, yelp, pinterest etc..)
By comparing metric (ndcg, mAP etc..) between different dataset and algorithm, figure out which algorithm is suitable for specific situation.

Architectures

flowchart LR
    A[Download data] --> B[Load data]
    B --> C[Preprocess data]
    C --> D[Prepare model data]
    D --> E[Training]
    E --> F[Summarize results]

Step	Code	Description
Download data	`scripts/download`	Download selected dataset from public url
Load data	`recommender/load_data`	Load downloaded dataset with pandas data type
Preprocess data	`recommender/preprocess`	Preprocess dataset
Prepare model data	`recommender/prepare_model_data`	Convert dataset which will be fed to models
Training	`recommender/model`	Train various recommender algorithms
Summarize results	`recommender/libs/plot`	Make metric plots, loss curve

Implemented models

Models	Input data type	Path	Possible loss function
`User based CF`	`csr matrix`	`recommender/model/neighborhood/user_based.py`	`NA`
`SVD`	`pytorch dataset`	`recommender/model/mf/svd.py`	`MSE`, `BCE`, `BPR`
`SVD with bias`	`pytorch dataset`	`recommender/model/mf/svd_bias.py`	`MSE`, `BCE`, `BPR`
`ALS`	`csr matrix`	`recommender/model/mf/als.py`	`ALS`
`GMF`	`pytorch dataset`	`recommender/model/deep_learning/gmf.py`	`MSE`, `BCE`, `BPR`
`MLP`	`pytorch dataset`	`recommender/model/deep_learning/mlp.py`	`MSE`, `BCE`, `BPR`
`TWO-TOWER`	`pytorch dataset`	`recommender/model/deep_learning/two_tower.py`	`MSE`, `BCE`, `BPR`

Refer to following parameter description when running recommender/train.py or recommender/train_csr.py. You can check below parameters in this code.

Parameter explanations

Parameter name	Explanation	Default
`dataset`	integrated dataset name	required
`model`	implemented model name	required
`loss`	implemented loss name	required
`implicit`	whether implicit dataset type or not	False
`num_neg`	number of negative samples	None
`neg_sample_strategy`	negative sampling strategy	None
`device`	device information, either cpu or cuda	cuda
`batch_size`	number of data in one batch	128
`lr`	learning rate controlling speed of gradient descent	1e-2
`regularization`	hyper parameter controlling balance between original loss and penalty	1e-4
`epochs`	number of training epochs	10
`num_factors`	dimension of user embedding and item embedding	128
`train_ratio`	ratio of training dataset	0.8
`random_state`	random seed for reproducibility	42
`patience`	tolerance count when validation loss does not drop	5
`result_path`	absolute directory to store training result	required
`num_sim_user_top_N`	number of users who are the most similar in top N (used in user_based CF)	45
`is_test`	when set true, use part of dataset when training for quick pytest	False

Supported dataset

For more details about how to download dataset in local, please refer scripts/download/README.md.

Name	Description	Related link
`movielens 1m`	movie rating dataset	url
`movielens 10m`	movie rating dataset	url

Supported loss

You can choose which loss to use when training models. Here is a list of loss that we are currently supporting.

Name	Type	Possible models
`Mean Squared Loss`	regression / explicit	`svd`, `svd_bias`
`Binary Cross Entropy Loss`	classification / implicit	`svd`, `svd_bias`, `gmf`, `mlp`, `two_tower`
`Triplet Loss (BPR)`	triplet / implicit	`svd`, `svd_bias`, `gmf`, `mlp`, `two_tower`
`ALS Loss`	implicit	`als`

Be careful of possible models in each loss type when choosing appropriate loss. For example, you cannot use MSE loss function with gmf model.

Loss function could affect models' performance. Figure out which loss is best suited for your dataset !

Project code lint

We use ruff lint for project code consistency. Run following command if ruff lint check passes.

$ make lint

You should update code corresponding to ruff's guide, otherwise ci test won't pass.

How to run pytest

Run following command.

$ make test

How to contribute

Although any kinds of PRs are warmly welcomed, please refer to following rules.

After opening PRs, all the integration tests should be passed.
Basic tests in tests/ directory should be added.
Depending on input data type (pytorch dataset or csr matrix), you should integrate your implementation to recommender/train.py or recommender/train_csr.py.
When adding new models, please include followings in PR.
- Experiment results including metric plot, metric value, loss plot after running recommender/train.py or recommender/train_csr.py with your arguments.
- Example command to reproduce model training result.
- Full logs when executing model training python script.

Note

Although you can select which device (cpu or cuda) to use when training model, please note that code is not optimized with the cuda case, so be careful when using training with cuda.

Name		Name	Last commit message	Last commit date
Latest commit History 244 Commits
.github/workflows		.github/workflows
recommender		recommender
scripts/download		scripts/download
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

recommender

Setting up environment

Quick start

Download dataset

Run pipeline

Check experiment result

Why made this repository?

Architectures

Implemented models

Supported dataset

Supported loss

Project code lint

How to run pytest

How to contribute

Note

About

Releases

Packages

Languages

License

bohyunshin/recommender

Folders and files

Latest commit

History

Repository files navigation

recommender

Setting up environment

Quick start

Download dataset

Run pipeline

Check experiment result

Why made this repository?

Architectures

Implemented models

Supported dataset

Supported loss

Project code lint

How to run pytest

How to contribute

Note

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages