Classic algorithms are fair learners: Classification Analysis of natural weather and wildfire occurrences

Introduction

Classic machine learning algorithms have been reviewed and studied mathematically on its performance and properties in detail. This paper intends to review the empirical functioning of widely used classical supervised learning algorithms such as Decision Trees, Boosting, Support Vector Machines, k-nearest Neighbors and a shallow Artificial Neural Network. The paper evaluates these algorithms on a sparse tabular data for classification task and observes the effect on specific hyperparameters on these algorithms when the data is synthetically modified for higher noise. These perturbations were introduced to observe these algorithms on their efficiency in generalizing for sparse data and their utility of different parameters to improve classification accuracy. The paper intends to show that these classic algorithms are fair learners even for such limited data due to their inherent properties even for noisy and sparse datasets.

Citation

If you find this project useful in your research or work, please consider citing it:

@article{gopal2023classic,
  title={Classic algorithms are fair learners: Classification Analysis of natural weather and wildfire occurrences},
  author={Gopal, Senthilkumar},
  journal={arXiv preprint arXiv:2309.01381},
  year={2023}
}

Setup Instructions

All the necessary files are available at
git clone or download the files to continue with the setup

Datasets

a. Rattle - https://www.kaggle.com/jsphyg/weather-dataset-rattle-package b. Wildfire - https://www.kaggle.com/rtatman/188-million-us-wildfires

Dataset setup

Rattle dataset is already available in Github. For some reason, if that needs to be updated, drop the downloaded file as \data\rattle\weatherAUS.csv
Wildfire is too large and is "not" available in Github. Download the file FPA_FOD_20170508.sqlite from https://www.kaggle.com/rtatman/188-million-us-wildfires and place in location - \data\wildfire\FPA_FOD_20170508.sqlite

Installations

All the files need Python version 3.7 and above
The environment file has been exported as environment.yml
Use conda env create -f environment. yml to create the py37 environment
Activate the new environment: conda activate py37
Incase if you believe a working py37 environment is setup and reasonably stable, check for the following libraries and their versions. pd.version 0.25.0 np.version 1.16.4 sklearn.version 0.21.2 seaborn.version 0.9.0 matplotlib.version 3.1.0
An optional setup is required for Graphviz to view the decision tree images. In case the executable is not available, the code continues gracefully, with an error message.
Install GraphViz from https://graphviz.gitlab.io/ and ensure it is available in the PATH

Folders and Locations

The following folders are required without which the code would fail.

images/rattle
images/wildfire
images/rattle/pre-process
images/wildfire/pre-process
output/results
output/results/rattle
output/results/wildfire

Execution

run the python file data_read.py which performs all the necessary actions for all the five algorithms. The typical runtime with default variables would be around 45 minutes.

Variables and runtime

All the following variables are available at the beginning of the data_read.py file

VALIDATION_CURVE - If set to True, the code will generate all the validation curves for different hyperparameters. This increases the execution time by atleast 1 hour.
GRID_SEARCH - If set to True, the code will run a grid search to find the optimal values of hyper parameter combinations instead of using the identified ones. This increases the execution time by atleast 3-4 hours. Please use this with caution.
EPOCH_GRAPH - If set to True, the code will generate the Learning curve using epochs for NN and SVM. This increases the execution time by atleast 1 hour.

Note

Ignore any warnings generated on the console as most of them were deprecation related.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
images		images
output/results		output/results
.gitignore		.gitignore
BoostLearner.py		BoostLearner.py
ClassifierType.py		ClassifierType.py
DTLearner.py		DTLearner.py
DummyLearner.py		DummyLearner.py
KNNLearner.py		KNNLearner.py
LICENSE		LICENSE
NNLearner.py		NNLearner.py
README.md		README.md
SVMLearner.py		SVMLearner.py
data_analysis.ipynb		data_analysis.ipynb
data_read.py		data_read.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classic algorithms are fair learners: Classification Analysis of natural weather and wildfire occurrences

Introduction

Citation

Setup Instructions

Datasets

Dataset setup

Installations

Folders and Locations

Execution

Variables and runtime

Note

About

Languages

License

sengopal/classic-ml-review-paper

Folders and files

Latest commit

History

Repository files navigation

Classic algorithms are fair learners: Classification Analysis of natural weather and wildfire occurrences

Introduction

Citation

Setup Instructions

Datasets

Dataset setup

Installations

Folders and Locations

Execution

Variables and runtime

Note

About

Topics

Resources

License

Stars

Watchers

Forks

Languages