Skip to content

Commit

Permalink
Merge pull request #227 from amosproj/dev
Browse files Browse the repository at this point in the history
Sprint 12 Release
  • Loading branch information
soapyheas authored Jan 31, 2024
2 parents 1aab6ed + 2003668 commit fe1f2e9
Show file tree
Hide file tree
Showing 24 changed files with 973 additions and 488 deletions.
5 changes: 3 additions & 2 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ jobs:
pipenv run flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
pipenv run flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
- name: Test with pytest and create coverage report
run: |
pipenv run pytest
pipenv run coverage run --source ./ -m pytest -v tests/
pipenv run coverage report -m
Binary file added Deliverables/sprint-12/demo-day-slide.pdf
Binary file not shown.
2 changes: 2 additions & 0 deletions Deliverables/sprint-12/demo-day-slide.pdf.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Simon Zimmermann <tim.simon.zimmermann@fau.de>
Binary file added Deliverables/sprint-12/demo-day-video.mp4
Binary file not shown.
2 changes: 2 additions & 0 deletions Deliverables/sprint-12/demo-day-video.mp4.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Simon Zimmermann <tim.simon.zimmermann@fau.de>
Binary file added Deliverables/sprint-12/feature-board.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions Deliverables/sprint-12/feature-board.png.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Simon Zimmermann <tim.simon.zimmermann@fau.de>
Binary file added Deliverables/sprint-12/imp-squared-backlog.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions Deliverables/sprint-12/imp-squared-backlog.jpg.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Nico Hambauer <nico.hambauer@fau.de>
Binary file added Deliverables/sprint-12/planning-documents.pdf
Binary file not shown.
2 changes: 2 additions & 0 deletions Deliverables/sprint-12/planning-documents.pdf.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Simon Zimmermann <tim.simon.zimmermann@fau.de>
93 changes: 65 additions & 28 deletions Documentation/Classifier-Comparison.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Fully Connected Neural Networks (FCNN) achieved overall lower performance than t
### Fully Connected Neural Networks Regression Model

There has been an idea written in the scientific paper "Inter-species cell detection -
datasets on pulmonary hemosiderophages in equine, human and feline specimens" by Marzahl et al. where they proposed using regression model on a classification task. The idea is to train the regression model on the class values, whereas the model predicts a continous values and learns the relation between the classes. The output is then subjected to threshholds (0-0.49,0.5-1.49,1.5-2.49,2.5-3.49,3.5-4.5) for classes XS, S, M, L, XL respectivly. This yielded better performance than the FCNN classifier but still was worse than that of the Random Forest.
datasets on pulmonary hemosiderophages in equine, human and feline specimens" by Marzahl et al. (https://www.nature.com/articles/s41597-022-01389-0) where they proposed using regression model on a classification task. The idea is to train the regression model on the class values, whereas the model predicts a continous values and learns the relation between the classes. The output is then subjected to threshholds (0-0.49,0.5-1.49,1.5-2.49,2.5-3.49,3.5-4.5) for classes XS, S, M, L, XL respectivly. This yielded better performance than the FCNN classifier but still was worse than that of the Random Forest.

### QDA & Ridge Classifier

Expand All @@ -54,53 +54,90 @@ classes had F1-scores of ~0.00-0.15. For this reason we are not considering thes
in future experiments. This resulted in an overall F1-score of ~0.11, which is significantly
outperformed by the other tested models.

### TabNet Architecture

TabNet, short for "Tabular Neural Network," is a novel neural network architecture specifically designed for tabular data, commonly encountered in structured data, such as databases and CSV files. It was introduced in the paper titled "TabNet: Attentive Interpretable Tabular Learning" by Arik et al. (https://arxiv.org/abs/1908.07442). TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. Unfortunately, TabNet similarly to our proposed 4 layer network, TabNet only learned the features of the XS class with XS f1 score of 0.84, while the other f1 scores of other classes are zeros. The underlying data does not seem to respond positively to neural network-based approaches.

## Well performing models

### Random Forest Classifier
In this sub-section we will discuss the results of well performing models, which arer XGBoost, LightGBM, K-Nearest Neighbor (KNN), Random Forest, AdaBoost and Naive Bayes.

### Feature subsets

We have collected a lot of features (~54 data points) for the leads, additionally one-hot encoding the categorical variables
results in a high dimensional feature space (132 features). Not all features might be equally relevant for our classification task
so we want to try different subsets.

The following subsets are available:

Random Forest Classifier with 100 estimators has been been able to achieve an overall F1-score of 0.62 and scores of 0.81, 0.13, 0.09, 0.08 and 0.15 for classes XS, S, M, L and XL respectively.
1. `google_places_rating`, `google_places_user_ratings_total`, `google_places_confidence`, `regional_atlas_regional_score`

### Overall Results

Note:
The Random Forest Classifier used 100 estimators.
The KNN classifier used a distance based weighting for the evaluated neighbors and considered 10 neighbors in the 5-class split and 19 neighbors for the 3-class split.
The XGBoost was trained for 10000 rounds.
**_Notes:_**

- The Random Forest Classifier used 100 estimators.
- The AdaBoost Classifier used 100 DecisionTree classifiers.
- The KNN classifier used a distance based weighting for the evaluated neighbors and considered 10 neighbors in the 5-class split and 19 neighbors for the 3-class split.
- The XGBoost was trained for 10000 rounds.
- The LightGBM was trained with 2000 number of leaves


In the following table we can see the model's overall weighted F1-score on the 3-class and
5-class data set split.
5-class data set split. The best performing classifiers per row is marked **bold**.

| | KNN | Naive Bayes | Random Forest | XGBoost | AdaBoost | LightGBM |
| ------- | ------ | ----------- | ------------- | ---------- | -------- | -------- |
| 5-Class | 0.6314 | 0.6073 | 0.6150 | **0.6442** | 0.6098 | 0.6405 |
| 3-Class | 0.6725 | 0.6655 | 0.6642 | **0.6967** | 0.6523 | 0.6956 |

| | KNN | Naive Bayes | Random Forest | XGBoost |
| ------- | ------ | ----------- | ------------- | ------- |
| 5-Class | 0.6314 | 0.6073 | 0.6150 | 0.6442 |
| 3-Class | 0.6725 | 0.6655 | 0.6642 | 0.6967 |
| | KNN (subset=1) | Naive Bayes (subset=1) | RandomForest (subset=1) | XGBoost (subset=1) | AdaBoost (subset=1) | LightGBM (subset=1) |
| ------- | -------------- | ---------------------- | ----------------------- | ------------------ | ------------------- | ------------------- |
| 5-Class | 0.6288 | 0.6075 | 0.5995 | **0.6198** | 0.6090 | 0.6252 |
| 3-Class | 0.6680 | 0.6075 | 0.6506 | **0.6664** | 0.6591 | 0.6644 |

We can see that all classifiers perform better on the 3-class data set split and that the XGBoost classifier is the best performing for both data set splits.
We can see that all classifiers perform better on the 3-class data set split and that the XGBoost classifier is the best performing for both data set splits. These results are consistent for both the full dataset as well as subset 1. We observe a slight performance for almost all classifiers when using subset 1 compared to the full dataset (except AdaBoost/3-class and Naive Bayes/5-class). This indicates that the few features retained in subset 1 are not the sole discriminant features of the dataset. However, the performance is still high enough to suggest that the features in subset 1 are highly relevant to make classifications on the data.

### Results for each class

#### 5-class split

In the following table we can see the F1-score of each model for each class in the 5-class split:

| Class | KNN | Naive Bayes | Random Forest | XGBoost |
| ----- | ---- | ----------- | ------------- | ------- |
| XS | 0.82 | 0.83 | 0.81 | 0.84 |
| S | 0.15 | 0.02 | 0.13 | 0.13 |
| M | 0.08 | 0.02 | 0.09 | 0.08 |
| L | 0.06 | 0.00 | 0.08 | 0.06 |
| XL | 0.18 | 0.10 | 0.15 | 0.16 |

For every model we can see that the predictions on the XS class are significantly better than every other class. TFor the KNN, Random Forest, and XGBoost all perform similar, having second best classes S and XL and worst classes M and L. The Naive Bayes classifier performs significantly worse on the S, M, and L classes and has second best class XL.
| Class | KNN | Naive Bayes | Random Forest | XGBoost | AdaBoost | LightGBM |
| ----- | ---- | ----------- | ------------- | -------- | -------- | -------- |
| XS | 0.82 | 0.83 | 0.81 | **0.84** | 0.77 | 0.83 |
| S | 0.15 | 0.02 | 0.13 | 0.13 | **0.22** | 0.14 |
| M | 0.08 | 0.02 | 0.09 | 0.08 | **0.14** | 0.09 |
| L | 0.06 | 0.00 | **0.08** | 0.06 | 0.07 | 0.05 |
| XL | 0.18 | 0.10 | 0.15 | 0.16 | 0.14 | **0.21** |

| Class | KNN (subset=1) | Naive Bayes (subset=1) | RandomForest (subset=1) | XGBoost (subset=1) | AdaBoost (subset=1) | LightGBM (subset=1) |
| ----- | -------------- | ---------------------- | ----------------------- | ------------------ | ------------------- | ------------------- |
| XS | 0.82 | 0.84 | 0.78 | **0.84** | 0.78 | 0.82 |
| S | 0.16 | 0.00 | 0.16 | 0.04 | **0.19** | 0.13 |
| M | 0.07 | 0.00 | 0.07 | 0.02 | **0.09** | 0.08 |
| L | **0.07** | 0.00 | 0.06 | 0.05 | **0.07** | 0.06 |
| XL | **0.19** | 0.00 | 0.11 | 0.13 | 0.14 | 0.18 |

For every model we can see that the predictions on the XS class are significantly better than every other class. For the KNN, Random Forest, and XGBoost all perform similar, having second best classes S and XL and worst classes M and L. The Naive Bayes classifier performs significantly worse on the S, M, and L classes and has second best class XL.
Using subset 1 again mostly decreased performance on all classes, with the exception of the KNN classifier and classes L and XL where we can observe a slight increase in F1-score.

#### 3-class split

In the following table we can see the F1-score of each model for each class in the 3-class split:

| Class | KNN | Naive Bayes | Random Forest | XGBoost |
| ----- | ---- | ----------- | ------------- | ------- |
| XS | 0.83 | 0.82 | 0.81 | 0.84 |
| S,M,L | 0.27 | 0.28 | 0.30 | 0.33 |
| XL | 0.16 | 0.07 | 0.13 | 0.14 |
| Class | KNN | Naive Bayes | Random Forest | XGBoost | AdaBoost | LightGBM |
| ----- | ---- | ----------- | ------------- | -------- | -------- | -------- |
| XS | 0.83 | 0.82 | 0.81 | **0.84** | 0.78 | 0.83 |
| S,M,L | 0.27 | 0.28 | 0.30 | 0.33 | **0.34** | **0.34** |
| XL | 0.16 | 0.07 | 0.13 | 0.14 | 0.12 | **0.19** |

| Class | KNN (subset=1) | Naive Bayes (subset=1) | RandomForest (subset=1) | XGBoost (subset=1) | AdaBoost (subset=1) | LightGBM (subset=1) |
| ----- | -------------- | ---------------------- | ----------------------- | ------------------ | ------------------- | ------------------- |
| XS | 0.82 | 0.84 | 0.79 | **0.84** | 0.79 | 0.81 |
| S,M,L | 0.29 | 0.00 | 0.30 | 0.22 | **0.32** | 0.28 |
| XL | 0.18 | 0.00 | 0.11 | 0.11 | **0.20** | 0.17 |

For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the XGBoost model slightly outperforms the other models. The KNN classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the s, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split.
For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the LightGBM model slightly outperforms the other models. The LightGBM classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the S, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split.
The AdaBoost Classifier, trained on subset 1, performs best for the XL class. The KNN classifier got a slight boost in performance for the {S, M, L} and XL classes when using subset 1. All other models perform worse on subset 1.
10 changes: 6 additions & 4 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ name = "pypi"

[dev-packages]
pytest = "==7.4.0"
coverage = "==7.4.1"
pre-commit = "==3.5.0"
flake8 = "==6.0.0"
pytest-env = "==1.0.1"
Expand Down Expand Up @@ -44,14 +45,15 @@ textblob = "==0.17.1"
deep-translator = "==1.11.4"
fsspec = "2023.12.2"
s3fs = "2023.12.2"
imblearn = "*"
sagemaker = "*"
imblearn = "==0.0"
sagemaker = "==2.198.0"
joblib = "1.3.2"
xgboost = "*"
colorama = "*"
xgboost = "==2.0.3"
colorama = "==0.4.6"
torch = "2.1.2"
deutschland = "0.4.0"
bs4 = "0.0.2"
lightgbm = "==4.3.0"

[requires]
python_version = "3.10"
Loading

0 comments on commit fe1f2e9

Please sign in to comment.