amosproj · ultiwinter · Jan 29, 2024 · Jan 26, 2024 · Jan 26, 2024 · Jan 26, 2024
diff --git a/Documentation/Classifier-Comparison.md b/Documentation/Classifier-Comparison.md
@@ -44,7 +44,7 @@ Fully Connected Neural Networks (FCNN) achieved overall lower performance than t
 ### Fully Connected Neural Networks Regression Model
 
 There has been an idea written in the scientific paper "Inter-species cell detection -
-datasets on pulmonary hemosiderophages in equine, human and feline specimens" by Marzahl et al. where they proposed using regression model on a classification task. The idea is to train the regression model on the class values, whereas the model predicts a continous values and learns the relation between the classes. The output is then subjected to threshholds (0-0.49,0.5-1.49,1.5-2.49,2.5-3.49,3.5-4.5) for classes XS, S, M, L, XL respectivly. This yielded better performance than the FCNN classifier but still was worse than that of the Random Forest.
+datasets on pulmonary hemosiderophages in equine, human and feline specimens" by Marzahl et al. (https://www.nature.com/articles/s41597-022-01389-0) where they proposed using regression model on a classification task. The idea is to train the regression model on the class values, whereas the model predicts a continous values and learns the relation between the classes. The output is then subjected to threshholds (0-0.49,0.5-1.49,1.5-2.49,2.5-3.49,3.5-4.5) for classes XS, S, M, L, XL respectivly. This yielded better performance than the FCNN classifier but still was worse than that of the Random Forest.
 
 ### QDA & Ridge Classifier
 
@@ -54,26 +54,41 @@ classes had F1-scores of ~0.00-0.15. For this reason we are not considering thes
 in future experiments. This resulted in an overall F1-score of ~0.11, which is significantly
 outperformed by the other tested models.
 
+### TabNet Architecture
+
+TabNet, short for "Tabular Neural Network," is a novel neural network architecture specifically designed for tabular data, commonly encountered in structured data, such as databases and CSV files. It was introduced in the paper titled "TabNet: Attentive Interpretable Tabular Learning" by Arik et al. (https://arxiv.org/abs/1908.07442). TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. Unfortunately, TabNet similarly to our proposed 4 layer network, TabNet only learned the features of the XS class with XS f1 score of 0.84, while the other f1 scores of other classes are zeros. The underlying data does not seem to respond positively to neural network-based approaches.
+
 ## Well performing models
 
-### Random Forest Classifier
+In this sub-section we will discuss the results of well performing models, which arer XGBoost, LightGBM, K-Nearest Neighbor (KNN), Random Forest, AdaBoost and Naive Bayes.
+
+### Feature subsets
 
-Random Forest Classifier with 100 estimators has been been able to achieve an overall F1-score of 0.62 and scores of 0.81, 0.13, 0.09, 0.08 and 0.15 for classes XS, S, M, L and XL respectively.
+We have collected a lot of features (~54 data points) for the leads, additionally one-hot encoding the categorical variables
+results in a high dimensional feature space (132 features). Not all features might be equally relevant for our classification task
+so we want to try different subsets.
+
+The following subsets are available:
+
+1. `google_places_rating`, `google_places_user_ratings_total`, `google_places_confidence`, `regional_atlas_regional_score`
 
 ### Overall Results
 
-Note:
-The Random Forest Classifier used 100 estimators.
-The KNN classifier used a distance based weighting for the evaluated neighbors and considered 10 neighbors in the 5-class split and 19 neighbors for the 3-class split.
-The XGBoost was trained for 10000 rounds.
+**_Notes:_**
+
+- The Random Forest Classifier used 100 estimators.
+- The AdaBoost Classifier used 100 DecisionTree classifiers.
+- The KNN classifier used a distance based weighting for the evaluated neighbors and considered 10 neighbors in the 5-class split and 19 neighbors for the 3-class split.
+- The XGBoost was trained for 10000 rounds.
+- The LightGBM was trained with 2000 number of leaves
 
 In the following table we can see the model's overall weighted F1-score on the 3-class and
-5-class data set split.
+5-class data set split. The best performing classifiers per row is marked **bold**.
 
-|         | KNN    | Naive Bayes | Random Forest | XGBoost |
-| ------- | ------ | ----------- | ------------- | ------- |
-| 5-Class | 0.6314 | 0.6073      | 0.6150        | 0.6442  |
-| 3-Class | 0.6725 | 0.6655      | 0.6642        | 0.6967  |
+|         | KNN    | Naive Bayes | Random Forest | XGBoost    | AdaBoost | AdaBoost(subset=1) | LightGBM |
+| ------- | ------ | ----------- | ------------- | ---------- | -------- | ------------------ | -------- |
+| 5-Class | 0.6314 | 0.6073      | 0.6150        | **0.6442** | 0.6098   | 0.6090             | 0.6405   |
+| 3-Class | 0.6725 | 0.6655      | 0.6642        | **0.6967** | 0.6523   | 0.6591             | 0.6956   |
 
 We can see that all classifiers perform better on the 3-class data set split and that the XGBoost classifier is the best performing for both data set splits.
 
@@ -83,24 +98,25 @@ We can see that all classifiers perform better on the 3-class data set split and
 
 In the following table we can see the F1-score of each model for each class in the 5-class split:
 
-| Class | KNN  | Naive Bayes | Random Forest | XGBoost |
-| ----- | ---- | ----------- | ------------- | ------- |
-| XS    | 0.82 | 0.83        | 0.81          | 0.84    |
-| S     | 0.15 | 0.02        | 0.13          | 0.13    |
-| M     | 0.08 | 0.02        | 0.09          | 0.08    |
-| L     | 0.06 | 0.00        | 0.08          | 0.06    |
-| XL    | 0.18 | 0.10        | 0.15          | 0.16    |
+| Class | KNN  | Naive Bayes | Random Forest | XGBoost  | AdaBoost | AdaBoost(subset=1) | LightGBM |
+| ----- | ---- | ----------- | ------------- | -------- | -------- | ------------------ | -------- |
+| XS    | 0.82 | 0.83        | 0.81          | **0.84** | 0.77     | 0.78               | 0.83     |
+| S     | 0.15 | 0.02        | 0.13          | 0.13     | **0.22** | 0.19               | 0.14     |
+| M     | 0.08 | 0.02        | 0.09          | 0.08     | **0.14** | 0.09               | 0.09     |
+| L     | 0.06 | 0.00        | **0.08**      | 0.06     | 0.07     | 0.07               | 0.05     |
+| XL    | 0.18 | 0.10        | 0.15          | 0.16     | 0.17     | 0.14               | **0.21** |
 
-For every model we can see that the predictions on the XS class are significantly better than every other class. TFor the KNN, Random Forest, and XGBoost all perform similar, having second best classes S and XL and worst classes M and L. The Naive Bayes classifier performs significantly worse on the S, M, and L classes and has second best class XL.
+For every model we can see that the predictions on the XS class are significantly better than every other class. For the KNN, Random Forest, and XGBoost all perform similar, having second best classes S and XL and worst classes M and L. The Naive Bayes classifier performs significantly worse on the S, M, and L classes and has second best class XL.
 
 #### 3-class split
 
 In the following table we can see the F1-score of each model for each class in the 3-class split:
 
-| Class | KNN  | Naive Bayes | Random Forest | XGBoost |
-| ----- | ---- | ----------- | ------------- | ------- |
-| XS    | 0.83 | 0.82        | 0.81          | 0.84    |
-| S,M,L | 0.27 | 0.28        | 0.30          | 0.33    |
-| XL    | 0.16 | 0.07        | 0.13          | 0.14    |
+| Class | KNN  | Naive Bayes | Random Forest | XGBoost  | AdaBoost | AdaBoost(subset=1) | LightGBM |
+| ----- | ---- | ----------- | ------------- | -------- | -------- | ------------------ | -------- |
+| XS    | 0.83 | 0.82        | 0.81          | **0.84** | 0.78     | 0.79               | 0.83     |
+| S,M,L | 0.27 | 0.28        | 0.30          | 0.33     | **0.34** | 0.32               | **0.34** |
+| XL    | 0.16 | 0.07        | 0.13          | 0.14     | 0.12     | **0.20**           | 0.19     |
 
-For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the XGBoost model slightly outperforms the other models. The KNN classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the s, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split.
+For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the XGBoost model slightly outperforms the other models. The KNN classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the S, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split.
+The AdaBoost Classifier, trained on subset 1, performs best for the XL class.
diff --git a/src/demo/demos.py b/src/demo/demos.py
@@ -7,12 +7,11 @@
 # SPDX-FileCopyrightText: 2023 Ahmed Sheta <ahmed.sheta@fau.de>
 
 
-from sklearn.metrics import classification_report, mean_squared_error
+from sklearn.metrics import classification_report
 
 from bdc import DataCollector
 from bdc.pipeline import Pipeline
 from database import get_database
-from database.parsers import LeadParser
 from demo.console_utils import (
     get_int_input,
     get_multiple_choice,
@@ -76,11 +75,28 @@ def evp_demo():
     ):
         limit_classes = True
 
+    feature_subsets = [
+        ["Include all features"],
+        [
+            "google_places_rating",
+            "google_places_user_ratings_total",
+            "google_places_confidence",
+            "regional_atlas_regional_score",
+        ],
+    ]
+    print("Do you want to train on a subset of features?")
+
+    for i, p in enumerate(feature_subsets):
+        print(f"({i}) : {p}")
+    feature_choice = get_int_input("", range(0, len(feature_subsets)))
+    feature_choice = None if feature_choice == 0 else feature_subsets[feature_choice]
+
     evp = EstimatedValuePredictor(
         data=data,
         model_type=model_type,
         model_name=model_name,
         limit_classes=limit_classes,
+        selected_features=feature_choice,
     )
 
     while True:

diff --git a/src/evp/evp.py b/src/evp/evp.py
@@ -1,17 +1,19 @@
 # SPDX-License-Identifier: MIT
 # SPDX-FileCopyrightText: 2023 Felix Zailskas <felixzailskas@gmail.com>
 
+import lightgbm as lgb
 import numpy as np
 import pandas as pd
 import xgboost as xgb
 from sklearn.model_selection import train_test_split
-from sklearn.utils import class_weight, resample
+from sklearn.utils import class_weight
 
-from database.models import Lead
 from evp.predictors import (
     XGB,
+    AdaBoost,
     Classifier,
     KNNClassifier,
+    LightGBM,
     MerchantSizeByDPV,
     NaiveBayesClassifier,
     Predictors,
@@ -37,11 +39,15 @@ def __init__(
         model_type: Predictors = Predictors.RandomForest,
         model_name: str = None,
         limit_classes: bool = False,
+        selected_features: list = None,
         **model_args,
     ) -> None:
         self.df = data
         self.num_classes = 5
-        features = self.df.drop("MerchantSizeByDPV", axis=1).to_numpy()
+        features = self.df.drop("MerchantSizeByDPV", axis=1)
+        if selected_features is not None:
+            features = features[selected_features]
+        features = features.to_numpy()
         if limit_classes:
             self.num_classes = 3
             self.df["new_labels"] = np.where(
@@ -91,6 +97,10 @@ def __init__(
                 self.lead_classifier = KNNClassifier(
                     model_name=model_name, **model_args
                 )
+            case Predictors.AdaBoost:
+                self.lead_classifier = AdaBoost(model_name=model_name, **model_args)
+            case Predictors.LightGBM:
+                self.lead_classifier = LightGBM(model_name=model_name, **model_args)
             case default:
                 log.error(
                     f"Error: EVP initialized with unsupported model type {model_type}!"

diff --git a/src/evp/predictors.py b/src/evp/predictors.py
@@ -4,12 +4,13 @@
 from abc import ABC, abstractmethod
 from enum import Enum
 
+import lightgbm as lgb
 import xgboost as xgb
-from sklearn.ensemble import RandomForestClassifier
+from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
 from sklearn.metrics import accuracy_score, classification_report, f1_score
 from sklearn.naive_bayes import BernoulliNB
 from sklearn.neighbors import KNeighborsClassifier
-from tqdm import tqdm
+from sklearn.tree import DecisionTreeClassifier
 
 from database import get_database
 from logger import get_logger
@@ -19,9 +20,11 @@
 
 class Predictors(Enum):
     RandomForest = "Random Forest"
-    XGBoost = "XG Boost"
+    XGBoost = "XGBoost"
     NaiveBayes = "Naive Bayes"
     KNN = "KNN Classifier"
+    AdaBoost = "AdaBoost"
+    LightGBM = "LightGBM"
 
 
 class MerchantSizeByDPV(Enum):
@@ -55,7 +58,7 @@ def predict(self, X) -> list[MerchantSizeByDPV]:
     def train(
         self, X_train, y_train, X_test, y_test, epochs=1, batch_size=None
     ) -> None:
-        log.info(f"Training {type(self).__name__}")
+        log.info(f"Training {type(self).__name__} for {epochs} epochs")
 
         self.model.fit(X_train, y_train)
 
@@ -260,3 +263,104 @@ def train(
         self.classification_report["epochs"] = epochs
         self.epochs = epochs
         self.f1_test = f1_test
+
+
+class AdaBoost(Classifier):
+    def __init__(
+        self,
+        model_name: str = None,
+        n_estimators=100,
+        class_weight=None,
+        random_state=42,
+    ) -> None:
+        super().__init__()
+        self.random_state = random_state
+        self.model = None
+        if model_name is not None:
+            self.load(model_name)
+            if self.model is None:
+                log.info(
+                    f"Loading model '{model_name}' failed. Initializing new untrained model!"
+                )
+                self._init_new_model(
+                    n_estimators=n_estimators, class_weight=class_weight
+                )
+        else:
+            self._init_new_model(n_estimators=n_estimators, class_weight=class_weight)
+
+    def _init_new_model(self, n_estimators=100, class_weight=None):
+        self.model = AdaBoostClassifier(
+            estimator=DecisionTreeClassifier(max_depth=None, class_weight=class_weight),
+            n_estimators=n_estimators,
+            random_state=self.random_state,
+        )
+
+    def predict(self, X) -> MerchantSizeByDPV:
+        return self.model.predict(X)
+
+    def train(
+        self, X_train, y_train, X_test, y_test, epochs=1, batch_size=None
+    ) -> None:
+        super().train(
+            X_train, y_train, X_test, y_test, epochs=epochs, batch_size=batch_size
+        )
+
+
+class LightGBM(Classifier):
+    def __init__(
+        self,
+        model_name: str = None,
+        num_leaves=2000,
+        random_state=42,
+    ) -> None:
+        super().__init__()
+        self.random_state = random_state
+        self.model = None
+        self.num_leaves = num_leaves
+        if model_name is not None:
+            self.load(model_name)
+            if self.model is None:
+                log.info(
+                    f"Loading model '{model_name}' failed. Initializing new untrained model!"
+                )
+                self._init_new_model(num_leaves == num_leaves)
+        else:
+            self._init_new_model(num_leaves == num_leaves)
+
+    def _init_new_model(self, num_rounds=1000):
+        self.params_lgb = {
+            "boosting_type": "gbdt",
+            "objective": "multiclass",
+            "metric": "multi_logloss",
+            "num_class": 5,
+            "num_leaves": self.num_leaves,
+            "max_depth": -1,
+            "learning_rate": 0.05,
+            "feature_fraction": 0.9,
+        }
+        self.model = lgb.LGBMClassifier(**self.params_lgb)
+
+    def predict(self, X) -> MerchantSizeByDPV:
+        return self.model.predict(X)
+
+    def train(
+        self, X_train, y_train, X_test, y_test, epochs=1, batch_size=None
+    ) -> None:
+        log.info("Training LightGBM")
+
+        self.model.fit(X_train, y_train)
+
+        # inference
+        y_pred = self.model.predict(X_test)
+        # metrics
+        accuracy = accuracy_score(y_test, y_pred)
+        f1_test = f1_score(y_test, y_pred, average="weighted")
+
+        log.info(f"F1 Score on Testing Set: {f1_test:.4f}")
+        log.info("Computing classification report")
+        self.classification_report = classification_report(
+            y_test, y_pred, output_dict=True
+        )
+        self.classification_report["epochs"] = epochs
+        self.epochs = epochs
+        self.f1_test = f1_test