Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added LightGBM model and the investigated TabNet architecture documentation #220

Merged
merged 7 commits into from
Jan 29, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 42 additions & 26 deletions Documentation/Classifier-Comparison.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Fully Connected Neural Networks (FCNN) achieved overall lower performance than t
### Fully Connected Neural Networks Regression Model

There has been an idea written in the scientific paper "Inter-species cell detection -
datasets on pulmonary hemosiderophages in equine, human and feline specimens" by Marzahl et al. where they proposed using regression model on a classification task. The idea is to train the regression model on the class values, whereas the model predicts a continous values and learns the relation between the classes. The output is then subjected to threshholds (0-0.49,0.5-1.49,1.5-2.49,2.5-3.49,3.5-4.5) for classes XS, S, M, L, XL respectivly. This yielded better performance than the FCNN classifier but still was worse than that of the Random Forest.
datasets on pulmonary hemosiderophages in equine, human and feline specimens" by Marzahl et al. (https://www.nature.com/articles/s41597-022-01389-0) where they proposed using regression model on a classification task. The idea is to train the regression model on the class values, whereas the model predicts a continous values and learns the relation between the classes. The output is then subjected to threshholds (0-0.49,0.5-1.49,1.5-2.49,2.5-3.49,3.5-4.5) for classes XS, S, M, L, XL respectivly. This yielded better performance than the FCNN classifier but still was worse than that of the Random Forest.

### QDA & Ridge Classifier

Expand All @@ -54,26 +54,41 @@ classes had F1-scores of ~0.00-0.15. For this reason we are not considering thes
in future experiments. This resulted in an overall F1-score of ~0.11, which is significantly
outperformed by the other tested models.

### TabNet Architecture

TabNet, short for "Tabular Neural Network," is a novel neural network architecture specifically designed for tabular data, commonly encountered in structured data, such as databases and CSV files. It was introduced in the paper titled "TabNet: Attentive Interpretable Tabular Learning" by Arik et al. (https://arxiv.org/abs/1908.07442). TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. Unfortunately, TabNet similarly to our proposed 4 layer network, TabNet only learned the features of the XS class with XS f1 score of 0.84, while the other f1 scores of other classes are zeros. The underlying data does not seem to respond positively to neural network-based approaches.

## Well performing models

### Random Forest Classifier
In this sub-section we will discuss the results of well performing models, which arer XGBoost, LightGBM, K-Nearest Neighbor (KNN), Random Forest, AdaBoost and Naive Bayes.

### Feature subsets

Random Forest Classifier with 100 estimators has been been able to achieve an overall F1-score of 0.62 and scores of 0.81, 0.13, 0.09, 0.08 and 0.15 for classes XS, S, M, L and XL respectively.
We have collected a lot of features (~54 data points) for the leads, additionally one-hot encoding the categorical variables
results in a high dimensional feature space (132 features). Not all features might be equally relevant for our classification task
so we want to try different subsets.

The following subsets are available:

1. `google_places_rating`, `google_places_user_ratings_total`, `google_places_confidence`, `regional_atlas_regional_score`

### Overall Results

Note:
The Random Forest Classifier used 100 estimators.
The KNN classifier used a distance based weighting for the evaluated neighbors and considered 10 neighbors in the 5-class split and 19 neighbors for the 3-class split.
The XGBoost was trained for 10000 rounds.
**_Notes:_**

- The Random Forest Classifier used 100 estimators.
- The AdaBoost Classifier used 100 DecisionTree classifiers.
- The KNN classifier used a distance based weighting for the evaluated neighbors and considered 10 neighbors in the 5-class split and 19 neighbors for the 3-class split.
- The XGBoost was trained for 10000 rounds.
- The LightGBM was trained with 2000 number of leaves

In the following table we can see the model's overall weighted F1-score on the 3-class and
5-class data set split.
5-class data set split. The best performing classifiers per row is marked **bold**.

| | KNN | Naive Bayes | Random Forest | XGBoost |
| ------- | ------ | ----------- | ------------- | ------- |
| 5-Class | 0.6314 | 0.6073 | 0.6150 | 0.6442 |
| 3-Class | 0.6725 | 0.6655 | 0.6642 | 0.6967 |
| | KNN | Naive Bayes | Random Forest | XGBoost | AdaBoost | AdaBoost(subset=1) | LightGBM |
| ------- | ------ | ----------- | ------------- | ---------- | -------- | ------------------ | -------- |
| 5-Class | 0.6314 | 0.6073 | 0.6150 | **0.6442** | 0.6098 | 0.6090 | 0.6405 |
| 3-Class | 0.6725 | 0.6655 | 0.6642 | **0.6967** | 0.6523 | 0.6591 | 0.6956 |

We can see that all classifiers perform better on the 3-class data set split and that the XGBoost classifier is the best performing for both data set splits.

Expand All @@ -83,24 +98,25 @@ We can see that all classifiers perform better on the 3-class data set split and

In the following table we can see the F1-score of each model for each class in the 5-class split:

| Class | KNN | Naive Bayes | Random Forest | XGBoost |
| ----- | ---- | ----------- | ------------- | ------- |
| XS | 0.82 | 0.83 | 0.81 | 0.84 |
| S | 0.15 | 0.02 | 0.13 | 0.13 |
| M | 0.08 | 0.02 | 0.09 | 0.08 |
| L | 0.06 | 0.00 | 0.08 | 0.06 |
| XL | 0.18 | 0.10 | 0.15 | 0.16 |
| Class | KNN | Naive Bayes | Random Forest | XGBoost | AdaBoost | AdaBoost(subset=1) | LightGBM |
| ----- | ---- | ----------- | ------------- | -------- | -------- | ------------------ | -------- |
| XS | 0.82 | 0.83 | 0.81 | **0.84** | 0.77 | 0.78 | 0.83 |
| S | 0.15 | 0.02 | 0.13 | 0.13 | **0.22** | 0.19 | 0.14 |
| M | 0.08 | 0.02 | 0.09 | 0.08 | **0.14** | 0.09 | 0.09 |
| L | 0.06 | 0.00 | **0.08** | 0.06 | 0.07 | 0.07 | 0.05 |
| XL | 0.18 | 0.10 | 0.15 | 0.16 | 0.17 | 0.14 | **0.21** |

For every model we can see that the predictions on the XS class are significantly better than every other class. TFor the KNN, Random Forest, and XGBoost all perform similar, having second best classes S and XL and worst classes M and L. The Naive Bayes classifier performs significantly worse on the S, M, and L classes and has second best class XL.
For every model we can see that the predictions on the XS class are significantly better than every other class. For the KNN, Random Forest, and XGBoost all perform similar, having second best classes S and XL and worst classes M and L. The Naive Bayes classifier performs significantly worse on the S, M, and L classes and has second best class XL.

#### 3-class split

In the following table we can see the F1-score of each model for each class in the 3-class split:

| Class | KNN | Naive Bayes | Random Forest | XGBoost |
| ----- | ---- | ----------- | ------------- | ------- |
| XS | 0.83 | 0.82 | 0.81 | 0.84 |
| S,M,L | 0.27 | 0.28 | 0.30 | 0.33 |
| XL | 0.16 | 0.07 | 0.13 | 0.14 |
| Class | KNN | Naive Bayes | Random Forest | XGBoost | AdaBoost | AdaBoost(subset=1) | LightGBM |
| ----- | ---- | ----------- | ------------- | -------- | -------- | ------------------ | -------- |
| XS | 0.83 | 0.82 | 0.81 | **0.84** | 0.78 | 0.79 | 0.83 |
| S,M,L | 0.27 | 0.28 | 0.30 | 0.33 | **0.34** | 0.32 | **0.34** |
| XL | 0.16 | 0.07 | 0.13 | 0.14 | 0.12 | **0.20** | 0.19 |

For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the XGBoost model slightly outperforms the other models. The KNN classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the s, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split.
For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the XGBoost model slightly outperforms the other models. The KNN classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the S, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split.
The AdaBoost Classifier, trained on subset 1, performs best for the XL class.
20 changes: 18 additions & 2 deletions src/demo/demos.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,11 @@
# SPDX-FileCopyrightText: 2023 Ahmed Sheta <ahmed.sheta@fau.de>


from sklearn.metrics import classification_report, mean_squared_error
from sklearn.metrics import classification_report

from bdc import DataCollector
from bdc.pipeline import Pipeline
from database import get_database
from database.parsers import LeadParser
from demo.console_utils import (
get_int_input,
get_multiple_choice,
Expand Down Expand Up @@ -76,11 +75,28 @@ def evp_demo():
):
limit_classes = True

feature_subsets = [
["Include all features"],
[
"google_places_rating",
"google_places_user_ratings_total",
"google_places_confidence",
"regional_atlas_regional_score",
],
]
print("Do you want to train on a subset of features?")

for i, p in enumerate(feature_subsets):
print(f"({i}) : {p}")
feature_choice = get_int_input("", range(0, len(feature_subsets)))
feature_choice = None if feature_choice == 0 else feature_subsets[feature_choice]

evp = EstimatedValuePredictor(
data=data,
model_type=model_type,
model_name=model_name,
limit_classes=limit_classes,
selected_features=feature_choice,
)

while True:
Expand Down
16 changes: 13 additions & 3 deletions src/evp/evp.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
# SPDX-License-Identifier: MIT
# SPDX-FileCopyrightText: 2023 Felix Zailskas <felixzailskas@gmail.com>

import lightgbm as lgb
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight, resample
from sklearn.utils import class_weight

from database.models import Lead
from evp.predictors import (
XGB,
AdaBoost,
Classifier,
KNNClassifier,
LightGBM,
MerchantSizeByDPV,
NaiveBayesClassifier,
Predictors,
Expand All @@ -37,11 +39,15 @@ def __init__(
model_type: Predictors = Predictors.RandomForest,
model_name: str = None,
limit_classes: bool = False,
selected_features: list = None,
**model_args,
) -> None:
self.df = data
self.num_classes = 5
features = self.df.drop("MerchantSizeByDPV", axis=1).to_numpy()
features = self.df.drop("MerchantSizeByDPV", axis=1)
if selected_features is not None:
features = features[selected_features]
features = features.to_numpy()
if limit_classes:
self.num_classes = 3
self.df["new_labels"] = np.where(
Expand Down Expand Up @@ -91,6 +97,10 @@ def __init__(
self.lead_classifier = KNNClassifier(
model_name=model_name, **model_args
)
case Predictors.AdaBoost:
self.lead_classifier = AdaBoost(model_name=model_name, **model_args)
case Predictors.LightGBM:
self.lead_classifier = LightGBM(model_name=model_name, **model_args)
case default:
log.error(
f"Error: EVP initialized with unsupported model type {model_type}!"
Expand Down
112 changes: 108 additions & 4 deletions src/evp/predictors.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,13 @@
from abc import ABC, abstractmethod
from enum import Enum

import lightgbm as lgb
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, f1_score
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from tqdm import tqdm
from sklearn.tree import DecisionTreeClassifier

from database import get_database
from logger import get_logger
Expand All @@ -19,9 +20,11 @@

class Predictors(Enum):
RandomForest = "Random Forest"
XGBoost = "XG Boost"
XGBoost = "XGBoost"
NaiveBayes = "Naive Bayes"
KNN = "KNN Classifier"
AdaBoost = "AdaBoost"
LightGBM = "LightGBM"


class MerchantSizeByDPV(Enum):
Expand Down Expand Up @@ -55,7 +58,7 @@ def predict(self, X) -> list[MerchantSizeByDPV]:
def train(
self, X_train, y_train, X_test, y_test, epochs=1, batch_size=None
) -> None:
log.info(f"Training {type(self).__name__}")
log.info(f"Training {type(self).__name__} for {epochs} epochs")

self.model.fit(X_train, y_train)

Expand Down Expand Up @@ -260,3 +263,104 @@ def train(
self.classification_report["epochs"] = epochs
self.epochs = epochs
self.f1_test = f1_test


class AdaBoost(Classifier):
def __init__(
self,
model_name: str = None,
n_estimators=100,
class_weight=None,
random_state=42,
) -> None:
super().__init__()
self.random_state = random_state
self.model = None
if model_name is not None:
self.load(model_name)
if self.model is None:
log.info(
f"Loading model '{model_name}' failed. Initializing new untrained model!"
)
self._init_new_model(
n_estimators=n_estimators, class_weight=class_weight
)
else:
self._init_new_model(n_estimators=n_estimators, class_weight=class_weight)

def _init_new_model(self, n_estimators=100, class_weight=None):
self.model = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=None, class_weight=class_weight),
n_estimators=n_estimators,
random_state=self.random_state,
)

def predict(self, X) -> MerchantSizeByDPV:
return self.model.predict(X)

def train(
self, X_train, y_train, X_test, y_test, epochs=1, batch_size=None
) -> None:
super().train(
X_train, y_train, X_test, y_test, epochs=epochs, batch_size=batch_size
)


class LightGBM(Classifier):
def __init__(
self,
model_name: str = None,
num_leaves=2000,
random_state=42,
) -> None:
super().__init__()
self.random_state = random_state
self.model = None
self.num_leaves = num_leaves
if model_name is not None:
self.load(model_name)
if self.model is None:
log.info(
f"Loading model '{model_name}' failed. Initializing new untrained model!"
)
self._init_new_model(num_leaves == num_leaves)
else:
self._init_new_model(num_leaves == num_leaves)

def _init_new_model(self, num_rounds=1000):
self.params_lgb = {
"boosting_type": "gbdt",
"objective": "multiclass",
"metric": "multi_logloss",
"num_class": 5,
"num_leaves": self.num_leaves,
"max_depth": -1,
"learning_rate": 0.05,
"feature_fraction": 0.9,
}
self.model = lgb.LGBMClassifier(**self.params_lgb)

def predict(self, X) -> MerchantSizeByDPV:
return self.model.predict(X)

def train(
self, X_train, y_train, X_test, y_test, epochs=1, batch_size=None
) -> None:
log.info("Training LightGBM")

self.model.fit(X_train, y_train)

# inference
y_pred = self.model.predict(X_test)
# metrics
accuracy = accuracy_score(y_test, y_pred)
f1_test = f1_score(y_test, y_pred, average="weighted")

log.info(f"F1 Score on Testing Set: {f1_test:.4f}")
log.info("Computing classification report")
self.classification_report = classification_report(
y_test, y_pred, output_dict=True
)
self.classification_report["epochs"] = epochs
self.epochs = epochs
self.f1_test = f1_test
Loading