Setup

Notebook files

The Jupyter notebooks are pushed as .py files in the python percentage script format (we like meaningful diffs).
To get the actual notebook experience open them via jupyter with the jupytext plugin (gets installed as part of make setup).

PyTorch

PyTorch requires custom installation routines depending on your local setup. Hence it is not part of the make setup and the requirements.txt file and needs to be installed manually afterwards. Select your preferences and run the install command provided by the click-guide on https://pytorch.org/ .

Modeling scripts

The modeling scripts are found in ./modeling

MLflow

We use MLflow to track our model trainings. Therefore, it needs to be set up prior to running scripts in ./modeling:

The MLFLOW URI has to be added manually (not stored on git).
- Either set it locally in the .mlflow_uri file (which has to be done only once and will create a local file where the uri is stored):
```
echo http://127.0.0.1:5000/ > .mlflow_uri
```
- or export it as an environment variable (which has to be repeated on restart of your machine):
```
export MLFLOW_URI=http://127.0.0.1:5000/
```
- The code in the config.py will try to read the uri locally and if the file doesn't exist will look in the env var. If that is not set the URI will be empty in the code.
Create an MLFlow experiment with the name that is set in config.py. This has to be done only once. We use the experiment name nlp-trio in the config.py. You can either use the GUI to create an experiment MLflow Documentation or create a local experiment using the CLI:

mlflow experiments create --experiment-name <name-of-experiment>

Always start the mlflow server in a separate terminal session before executing a modeling script:

mlflow server

Access the UI via http://127.0.0.1:5000.

Word2vec and GloVe

The following modeling scripts do some part of their training on Word2Vec and GloVe embeddings. Run make embeddings to automatically download the embedding dictionaries to the right subfolder.

XGBoost
Logistic Regression
Support Vector Classifier
Random Forest Classifier
Naive Bayes Classifier
LightGBM

Dashboard

Prerequisite: Starting the backend requires that 4 saved models of the gbert Classifier are locally available. The backend loads them via torch.load(). That means that you have to perform a training via gbert_classifier.py for each of the following 4 labels prior to starting the dashboard backend:

label_needsmoderation
label_sentimentnegative
label_discriminating
label_inappropriate

Once the prerequisite is fulfilled, you can start the backend and the dashboard with the following steps:

Initial setup

Run make dashboard
Run make backend

Start backend

Open a dedicated terminal session.
Load the environment source .venv_backend/bin/activate
Start the backend via uvicorn prediction_server:app --reload

Start dashboard

Open a dedicated terminal session
Load the environment source .venv_frontend/bin/activate
Start the dashboard via python app.py and copy the address+port that is displayed.
Open it in your browser like http://127.0.0.1:8050/ (replace 8050 with the actual port displayed in step 3. above).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SETUP.md

SETUP.md

Setup

Notebook files

PyTorch

Modeling scripts

MLflow

Word2vec and GloVe

Dashboard

Files

SETUP.md

Latest commit

History

SETUP.md

File metadata and controls

Setup

Notebook files

PyTorch

Modeling scripts

MLflow

Word2vec and GloVe

Dashboard