Skip to content

Commit

Permalink
Merge pull request #237 from amosproj/documentation/sprint-13-ahmed
Browse files Browse the repository at this point in the history
Last sprint's deliverables
  • Loading branch information
ultiwinter authored Feb 7, 2024
2 parents 48b1a1f + 02cc828 commit 31bdcae
Show file tree
Hide file tree
Showing 28 changed files with 707 additions and 159 deletions.
Binary file added Deliverables/sprint-13/build-documentation.pdf
Binary file not shown.
2 changes: 2 additions & 0 deletions Deliverables/sprint-13/build-documentation.pdf.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2024 Ahmed Sheta <ahmed.sheta@fau.de>
Binary file added Deliverables/sprint-13/design-documentation.pdf
Binary file not shown.
2 changes: 2 additions & 0 deletions Deliverables/sprint-13/design-documentation.pdf.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2024 Ahmed Sheta <ahmed.sheta@fau.de>
Binary file added Deliverables/sprint-13/feature-board.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions Deliverables/sprint-13/feature-board.png.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2024 Simon Zimmermann <tim.simon.zimmermann@fau.de>
Binary file added Deliverables/sprint-13/imp-squared-backlog.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions Deliverables/sprint-13/imp-squared-backlog.jpg.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Nico Hambauer <nico.hambauer@fau.de>
Binary file added Deliverables/sprint-13/planning-document.pdf
Binary file not shown.
2 changes: 2 additions & 0 deletions Deliverables/sprint-13/planning-document.pdf.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 202$ Simon Zimmermann <tim.simon.zimmermann@fau.de>
Binary file added Deliverables/sprint-13/user-documentation.pdf
Binary file not shown.
2 changes: 2 additions & 0 deletions Deliverables/sprint-13/user-documentation.pdf.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2024 Ahmed Sheta <ahmed.sheta@fau.de>
54 changes: 0 additions & 54 deletions Documentation/Architecture.md

This file was deleted.

60 changes: 60 additions & 0 deletions Documentation/Build-Documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
<!--
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Felix Zailskas <felixzailskas@gmail.com>
-->

# Creating the Environment

The repository contains the file `.env.template`. This file is a template for
the environment variables that need to be set for the application to run. Copy
this file into a file called `.env` at the root level of this repository and
fill in all values with the corresponding secrets.

To create the virtual environment in this project you must have `pipenv`
installed on your machine. Then run the following commands:

```[bash]
# for development environment
pipenv install --dev
# for production environment
pipenv install
```

To work within the environment you can now run:

```[bash]
# to activate the virtual environment
pipenv shell
# to run a single command
pipenv run <COMMAND>
```

# Build Process

This application is built and tested on every push and pull request creation
through Github actions. For this, the `pipenv` environment is installed and then
the code style is checked using `flake8`. Finally, the `tests/` directory is
executed using `pytest` and a test coverage report is created using `coverage`.
The test coverage report can be found in the Github actions output.

In another task, all used packages are tested for their license to ensure that
the software does not use any copy-left licenses and remains open source and
free to use.

If any of these steps fail for a pull request the pull request is blocked from
being merged until the corresponding step is fixed.

Furthermore, it is required to install the pre-commit hooks as described
[here](https://github.com/amosproj/amos2023ws06-sales-lead-qualifier/wiki/Knowledge#pre-commit).
This ensures uniform coding style throughout the project as well as that the
software is compliant with the REUSE licensing specifications.

# Running the app

To run the application the `pipenv` environment must be installed and all needed
environment variables must be set in the `.env` file. Then the application can
be started via

```[bash]
pipenv run python src/main.py
```
30 changes: 14 additions & 16 deletions Documentation/Classifier-Comparison.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,14 @@ SPDX-FileCopyrightText: 2024 Ahmed Sheta <ahmed.sheta@fau.de>

# Classifier Comparison

This document compares the results of the following classifiers on the enriched and
preprocessed data set from the 22.01.2024.
## Abstract

This report presents a comprehensive evaluation of various classifiers trained on the historical dataset, which has been enriched and preprocessed through our pipeline. Each model type was tested on two splits of the data set. The used data set has five
classes for prediction corresponding to different merchant sizes, namely XS, S, M, L, and XL. The first split of the data set used exactly these classes for the prediction corresponding to the exact classes given by SumUp. The other data set split grouped the classes S, M, and L into one new class resulting in three classes of the form {XS}, {S, M, L}, and {XL}. While this does not exactly correspond to the given classes from SumUp, this simplification ofthe prediction task generally resulted in a better F1-score across models.

## Experimental Attempts

In accordance with the free lunch theorem, indicating no universal model superiority, multiple attempts were made to find the optimal solution. Unfortunately, certain models did not perform satisfactorily. Here are the experimented models and methodolgies

- Quadratic Discriminant Analysis (QDA)
- Ridge Classifier
Expand All @@ -18,24 +24,13 @@ preprocessed data set from the 22.01.2024.
- XGBoost Classifier Model
- K Nearest Neighbor Classifier (KNN)
- Bernoulli Naive Bayes Classifier

Each model type was tested on two splits of the data set. The used data set has five
classes for prediction corresponding to different merchant sizes, namely XS, S, M, L, and XL.
The first split of the data set used exactly these classes for the prediction corresponding
to the exact classes given by SumUp. The other data set split grouped the classes S, M, and L
into one new class resulting in three classes of the form {XS}, {S, M, L}, and {XL}. While
this does not exactly correspond to the given classes from SumUp, this simplification of
the prediction task generally resulted in a better F1-score across models.

## Experimental Attempts

According to free lunch theorem, there is no universal model or methodology that is top performing on every problem or data, therefore multiple attempts are crucal. In this section, we will document the experiments we tried and their corresponding performance and outputs.
- LightGBM

## Models not performing well

### Support Vector Machine Classifier Model

Training Support Vector Machine (SVM) took a while such that the training never ended. It is believed that it is the case because SVMs are very sensitive to the misclassifications and it finds a hard time minimizing them, given the data.
Training Support Vector Machine (SVM) took a while such that the training never ended. We believe that it is the case because SVMs are very sensitive to the misclassifications and it finds a hard time minimizing them, given the data.

### Fully Connected Neural Networks Classifier Model

Expand Down Expand Up @@ -82,7 +77,6 @@ The following subsets are available:
- The XGBoost was trained for 10000 rounds.
- The LightGBM was trained with 2000 number of leaves


In the following table we can see the model's overall weighted F1-score on the 3-class and
5-class data set split. The best performing classifiers per row is marked **bold**.

Expand Down Expand Up @@ -141,3 +135,7 @@ In the following table we can see the F1-score of each model for each class in t

For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the LightGBM model slightly outperforms the other models. The LightGBM classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the S, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split.
The AdaBoost Classifier, trained on subset 1, performs best for the XL class. The KNN classifier got a slight boost in performance for the {S, M, L} and XL classes when using subset 1. All other models perform worse on subset 1.

# Conclusion

In summary, XGBoost consistently demonstrated superior performance, showcasing robust results across various splits and subsets. However, it is crucial to note that its elevated score is attributed to potential overfitting on the XS class. Given SumUp's emphasis on accurate predictions for higher classes, we recommend considering LightGBM. This model outperformed XGBoost in predicting the XL class and the other classes, offering better results in both the five-class and three-class splits.
32 changes: 32 additions & 0 deletions Documentation/Controller.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
<!--
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Simon Zimmermann
SPDX-FileCopyrightText: 2023 Berkay Bozkurt <resitberkaybozkurt@gmail.com>
-->

# Automation

The _Controller_ is a planned component, that has not been implemented beyond a
conceptual prototype. In the planned scenario, the controller would coordinate
BDC, MSP and the external components as a centralized instance of control. In
contrast to our current design, this scenario would enable the automation of our
current workflow, where there are currently several steps of human interaction
required to achieve a prediction result for initially unprocessed lead data.

## Diagrams

The following diagrams were created during the prototyping phase for the
Controller component. As they are from an early stage of our project, the
Merchant Size Predictor is labelled as the (Estimated) Value Predictor here.

### Component Diagram

![Component Diagram](Media/component-diagram-with-controller.svg)

### Sequence Diagram

![Sequence Diagram](Media/sequence-diagram.svg)

### Controller Workflow Diagram

![Controller Workflow Diagram](Media/controller-workflow-diagram.jpg)
29 changes: 0 additions & 29 deletions Documentation/Data-Field-Definition.md

This file was deleted.

23 changes: 23 additions & 0 deletions Documentation/Data-Fields.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<!--
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Sophie Heasman <sophieheasmann@gmail.com>
SPDX-FileCopyrightText: 2024 Simon Zimmermann <tim.simon.zimmermann@fau.de>
-->

# Data Field Definitions

This document outlines the data fields obtained for each lead. The data can be
sourced from the online _Lead Form_ or be retrieved from the internet using
APIs.

## Data Field Table

The most recent Data Fields table can now be found in a
[separate CSV File](./data-fields.csv).

## Links to Data Sources:

Lead form: https://www.sumup.com/de-de/kontaktieren-vertriebsteam/ \
Google Places API: https://developers.google.com/maps/documentation/places/web-service/overview \
OpenAI API: https://platform.openai.com/docs/overview \
Meta API: https://developers.facebook.com/docs/graph-api/overview
Loading

0 comments on commit 31bdcae

Please sign in to comment.