Skip to content

4. Publishing Twitter Feed into Kafka & Running Spark Streaming application

Chia Yong Jian edited this page May 29, 2017 · 3 revisions

Step 1 - Upload Jupyter Notebooks

There are 2 Jupyter notebooks to upload:

  1. Twitter to Kafka - Download from this repository
  2. Spark Streaming - Download the notebook from https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/

Step 2 - Enter Twitter credentials and run codes

Open the “Twitter to Kafka” notebook. On the top of the notebook, enter the Twitter credentials you have saved previously

Run the codes on first three steps. On the third step, after running the code, you will be able to see the test message appearing on the PuTTY Terminal.

Before running the codes in the fourth block, you may choose to add or remove hashtags to retrieve tweets from. For multiple hashtags, use a comma to separate them. Once ready, run the code.

Step 3 - Run the Spark Streaming Jupyter notebook

Open the “Getting Started with Spark Streaming with Python and Kafka” notebook. Read the description and run through the steps as described in the notebook. For this iteration of the guide, this notebook will be used "as-is".

Note - There are some issues I have encountered in running this notebook, the following are modifications that can help to ease the issues:

Issue 1 - Cannot import pyspark when using Python3 Kernel.

Solution - Use the findspark library to set the sys.path and other environmental variables

Issue 2 - Encountered Python error at certain steps, such as "Extract Author name from each tweet"

Solution - Cast the DStream object into json format again. Example: