This repository contains code samples and materials for Layer's technical guide on "Machine Learning Pipelines with Apache Spark".
emr_files
contains an end-to-end code sample for building an ML Pipeline with Spark on Amazon Elastic MapReduce (EMR) and deploying Spark Pipeline on AWS SageMaker with MLeap as the execution engine.
To run the code on EMR, ensure you use the necessary configuration files to create a cluster with Spark:
boostrap_actions.sh
contains the shell script for custom actions during the cluster creation process. Ensure you upload it to an S3 bucket so you can use the URI to set up custom actions when configuring your clusteremr_config.json
should go under the classifications for your software eocnfiguration.- If you are struggling to find where to use these files, follow this blog post.
The code was tested with the following configurations:
- Release: emr-5.23.0
- Spark v2.4.0 and Livy v0.5.0 applications.
To run the code sample on Colab, check out the notebook.
Everything works fine except using MLeap. Colab does not work well with MLeap (=<1.18.0
) as MLeap will require Docker to be installed on local machine--which Colab does not support currently or plan on supporting anytime soon.
Ensure you install both PySpark and MLeap properly and use the notebook by starting from the header "If you are using Local Jupyter Notebook".
The dataset is credited to Ronny Kohavi and Barry Becker (from this paper) and was drawn from the 1994 United States Census Bureau data and involves using personal details such as education level to predict whether an individual will earn more or less than $50,000 per year.
Concepts from the Colab notebook are heavily inspired by Janani Ravi's course on "Building Machine Learning Models in Spark 2".