-
Notifications
You must be signed in to change notification settings - Fork 4
4. Publishing Twitter Feed into Kafka & Running Spark Streaming application
There are 2 Jupyter notebooks to upload:
- Twitter to Kafka - Download from this repository
- Spark Streaming - Download the notebook from https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/
Open the “Twitter to Kafka” notebook. On the top of the notebook, enter the Twitter credentials you have saved previously
Run the codes on first three steps. On the third step, after running the code, you will be able to see the test message appearing on the PuTTY Terminal.
Before running the codes in the fourth block, you may choose to add or remove hashtags to retrieve tweets from. For multiple hashtags, use a comma to separate them. Once ready, run the code.
Open the “Getting Started with Spark Streaming with Python and Kafka” notebook. Read the description and run through the steps as described in the notebook. For this iteration of the guide, this notebook will be used "as-is".
Note - There are some issues I have encountered in running this notebook, the following are modifications that can help to ease the issues:
Issue 1 - Cannot import pyspark when using Python3 Kernel.
Solution - Use the findspark library to set the sys.path and other environmental variables
Issue 2 - Encountered Python error at certain steps, such as "Extract Author name from each tweet"
Solution - Cast the DStream object into json format again. Example: