There are two ways of listening to the data in Kafka
Download Kafka v0.11.0.1 for Scala 2.11 (One example mirror here)
Move to desired location
cd ~/Downloads
mkdir -p /usr/local/Cellar/kafka/0.11.0.1
cp
Unzip the archive
tar -zxvf kafka_2.11-0.11.0.1.tgz
Set environment variable KAFKA_HOME to the installation path
export KAFKA_HOME=/usr/local/Cellar/kafka/0.11.0.1
Run the shell script
$KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server kafkastreaming.capitalonehackathon.com:9092 --topic au_hackathon
Clone this repository into local
git clone https://github.com/badrishdavey/au-hackathon-streaming-app.git
Cd into the directory
cd au-hackathon-streaming-app
Compile the code into jar
mvn clean package
Upload the Spark jar to your team directory in S3
Create the EMR cluster
Name the cluster after your team
Select Spark as the Software configuration
Select the Hackathon pem file for the EC2 key pair
Wait for the EMR cluster to initialize
Download the Spark jar from S3 onto the EMR master
ssh -i ~/AU_Hackathon.pem hadoop@ec2-54-159-186-164.compute-1.amazonaws.com
aws s3 cp s3://auhackathon/omar/au-hackathon-streaming-0.1.jar .
Navigate to the Steps tab for the EMR cluster Add step
Step type: Spark application
Deploy mode: Client
Spark-submit options: --master yarn --class com.test.App
Application location: /home/hadoop/au-hackathon-streaming-0.1.jar
Arguments: ec2-54-174-211-86.compute-1.amazonaws.com:9092 au_hackathon 5
Action on failure: Continue
Wait for the Spark job to start, approximately 5 minutes
Click on View logs
Click on stdout and you should see the dataframe output printout printing every few seconds