Skip to content

Parallelization project and report for DS8001: Design of Algorithms and Programming for Massive Data

License

Notifications You must be signed in to change notification settings

atabas/Parallelization-ensemble-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Parallelization of segments of the end to end ensemble training and evaluation for a text classification task

End to end heterogeneous scalable ensemble learning algorithm which demonstrates the performance gain of using parallelism in certain portions of the code. The ensemble learning algorithm is used for the training and evaluation of text classification. To be specific, 3 shallow machine learning models (namely Logistic Regression, Naive Bayes and Random Forest) and 3 deep neural models (LSTMs with 3 different dropout rates) are trained and the training and evaluation of these models is executed in parallel. Aside from the actual training and evaluation, the preprocessing of the data is also parallelized.

Because of lack of access to a distributed cluster (Hadoop/Spark) and GPU server, the implementation is limited to a single 4-core CPU machine using Python’s multiprocessing framework with the joblib wrapper. For the purpose of text classification, a problem where the task is to identify authors given text excerpts from books written by them has been chosen. For this dataset, there are 3 predefined authors: Edgar Allan Poe, Mary Shelley, and HP Lovecraft. The dataset from this Kaggle competition is from: https://www.kaggle.com/c/spooky-authoridentification . The training dataset contains more than 18,000 records which, while not gigantic, is big enough to allow us to simulate a big data problem and experiment with parallelization.

The end to end algorithm has been implemented with python’s multi-processing framework along with joblib wrapper, numpy, pandas, scikit learn and Keras.

image

image

About

Parallelization project and report for DS8001: Design of Algorithms and Programming for Massive Data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published