Skip to content
Chia Yong Jian edited this page May 30, 2017 · 12 revisions

Introduction

This Wiki will share on how to get a sample Apache Spark Streaming application to work. A Twitter feed will be used to send messages into a Kafka Topic, and in return, getting it read and processed by a Spark Streaming application.

The guide can be followed by the numbered steps in the pages list on the right of this page.

Motivation

With this guide, I hope that it can give aspiring Big Data developers or hobbyists a kickstart and a boost of confidence in setting up a simple streaming application end to end by themselves (and hopefully to avoid many pitfalls I have encountered in the process).

Technologies

Cloud platform used:

Technologies used:

Credits

This work is not solely my own - various websites, tutorials, and individuals were helpful in the entire process. They are as follows:

  1. Mining Twitter Data with Python (Part 1: Collecting data) by Marco Bonzanini
  2. Getting Started with Spark Streaming, Python, and Kafka by Robin Moffatt
  3. Run Jupyter Notebook and JupyterHub on Amazon EMR by Tom Zeng
  4. Using Python 3.4 on EMR Spark Applications by Bruno Faria
  5. Professor Andrew Koh

Other referenced work may appear in the pages of the guide.

Contact

For any feedback or questions, you may contact me at this email address:

chia.yongjian [at] gmail.com

Alternatively, you can raise an issue and I will look into it asap.