Skip to content

Spark application to consume kafka events generated by a python producer.

Notifications You must be signed in to change notification settings

cordon-thiago/spark-kafka-consumer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-kafka-consumer

Spark application that consumes Kafka events generated by a Python producer.

Architecture

alt text

How to run

  1. Clone the project
git clone https://github.com/cordon-thiago/spark-kafka-consumer
  1. Set the KAFKA_ADVERTISED_HOST_NAME variable inside the docker-compose.yml with your docker host IP. Note: Do not use localhost or 127.0.0.1 as the host ip if you want to run multiple brokers. More information about the variables you can configure on the kafka docker, please refer to this repository.

  2. Start docker containers with compose.

cd spark-kafka-consumer/docker
docker-compose up -d

It will start the following services:

  • zookeeper:
    • Image: wurstmeister/zookeeper
    • Port: 2181
  • kafka:
    • Image: wurstmeister/kafka:2.11-1.1.1
    • Port: 9092
  • spark:
    • Image: jupyter/all-spark-notebook
    • Port: 8888
  1. Get the Jupyter Notebook URL + Token accessing the spark container

Access the container bash

docker exec -it docker_spark_1 bash

Then, get the notebook URL. Copy and paste the URL in the browser.

jupyter notebook list
  1. Run the event-producer.ipynb notebook to start producing events from changes in Wikipedia pages to a Kafka topic. More information about the Wikipedia event here.

  2. Run the event-consumer-spark.ipynb notebook to start consuming events from the Kafka topic and write it in parquet files.

  3. Run the data-visualization.ipynb notebook to read the parquet files as streaming and visualize the top 10 users that have more edits.

About

Spark application to consume kafka events generated by a python producer.

Resources

Stars

Watchers

Forks

Packages

No packages published