The Jupyter notebooks are pushed as .py
files in the python percentage script format (we like meaningful diffs).
To get the actual notebook experience open them via jupyter with the jupytext plugin (gets installed as part of make setup
).
PyTorch requires custom installation routines depending on your local setup.
Hence it is not part of the make setup
and the requirements.txt file and needs to be installed manually afterwards.
Select your preferences and run the install command provided by the click-guide on https://pytorch.org/ .
The modeling scripts are found in ./modeling
We use MLflow to track our model trainings. Therefore, it needs to be set up prior to running scripts in ./modeling
:
-
The MLFLOW URI has to be added manually (not stored on git).
- Either set it locally in the .mlflow_uri file (which has to be done only once and will create a local file where the uri is stored):
echo http://127.0.0.1:5000/ > .mlflow_uri
- or export it as an environment variable (which has to be repeated on restart of your machine):
export MLFLOW_URI=http://127.0.0.1:5000/
- The code in the config.py will try to read the uri locally and if the file doesn't exist will look in the env var. If that is not set the URI will be empty in the code.
-
Create an MLFlow experiment with the name that is set in config.py. This has to be done only once. We use the experiment name
nlp-trio
in the config.py. You can either use the GUI to create an experiment MLflow Documentation or create a local experiment using the CLI:
mlflow experiments create --experiment-name <name-of-experiment>
- Always start the mlflow server in a separate terminal session before executing a modeling script:
mlflow server
- Access the UI via http://127.0.0.1:5000.
The following modeling scripts do some part of their training on Word2Vec and GloVe embeddings. Run make embeddings
to automatically download the embedding dictionaries to the right subfolder.
- XGBoost
- Logistic Regression
- Support Vector Classifier
- Random Forest Classifier
- Naive Bayes Classifier
- LightGBM
Prerequisite: Starting the backend requires that 4 saved models of the gbert Classifier are locally available. The backend loads them via torch.load()
. That means that you have to perform a training via gbert_classifier.py for each of the following 4 labels prior to starting the dashboard backend:
- label_needsmoderation
- label_sentimentnegative
- label_discriminating
- label_inappropriate
Once the prerequisite is fulfilled, you can start the backend and the dashboard with the following steps:
Initial setup
- Run
make dashboard
- Run
make backend
Start backend
- Open a dedicated terminal session.
- Load the environment
source .venv_backend/bin/activate
- Start the backend via
uvicorn prediction_server:app --reload
Start dashboard
- Open a dedicated terminal session
- Load the environment
source .venv_frontend/bin/activate
- Start the dashboard via
python app.py
and copy the address+port that is displayed. - Open it in your browser like
http://127.0.0.1:8050/
(replace 8050 with the actual port displayed in step 3. above).