A comparative study of the performance of four classification algorithms from the Apache Spark ML library
This repository contains the source code of the paper published at the CACIC 2021 conference entitled "A comparative study of the performance of four classification algorithms from the Apache Spark ML library".
- Python 3.5 or higher.
- First, create a Python virtualenv:
python3 -m venv venv
- Activate the virtualenv:
source venv/bin/activate
- Secondly, install the requirements:
pip install -r requirements.txt
- All the experiments are computed from the file
main.py
. Maybe you want to edit some parameters (there are some TODO sentences to check before running the script). To run the script locally:python3 main.py
- To run in a distributed environment (i.e. in an Apache Spark cluster), we use the configuration provided by this repository. Once the Spark cluster is ready, just run the following command to run the benchmarks detailed in the article:
- Enter the master node:
docker container exec -it [master node container ID] bash
- Once inside the container:
spark-submit main.py &> output_and_errors.txt &
- Enter the master node:
That command will run the experiments in background redirecting the stdout and stderr to file output_and_errors.txt
.