A comparative study of the performance of four classification algorithms from the Apache Spark ML library

This repository contains the source code of the paper published at the CACIC 2021 conference entitled "A comparative study of the performance of four classification algorithms from the Apache Spark ML library".

Requirements

Python 3.5 or higher.

Installation

First, create a Python virtualenv: python3 -m venv venv
Activate the virtualenv: source venv/bin/activate
Secondly, install the requirements: pip install -r requirements.txt

Execution

All the experiments are computed from the file main.py. Maybe you want to edit some parameters (there are some TODO sentences to check before running the script). To run the script locally: python3 main.py
To run in a distributed environment (i.e. in an Apache Spark cluster), we use the configuration provided by this repository. Once the Spark cluster is ready, just run the following command to run the benchmarks detailed in the article:
1. Enter the master node: docker container exec -it [master node container ID] bash
2. Once inside the container: spark-submit main.py &> output_and_errors.txt &

That command will run the experiments in background redirecting the stdout and stderr to file output_and_errors.txt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

A comparative study of the performance of four classification algorithms from the Apache Spark ML library

Requirements

Installation

Execution

Files

README.md

Latest commit

History

README.md

File metadata and controls

A comparative study of the performance of four classification algorithms from the Apache Spark ML library

Requirements

Installation

Execution