spark-kafka-consumer

Spark application that consumes Kafka events generated by a Python producer.

Architecture

git clone https://github.com/cordon-thiago/spark-kafka-consumer

Set the KAFKA_ADVERTISED_HOST_NAME variable inside the docker-compose.yml with your docker host IP. Note: Do not use localhost or 127.0.0.1 as the host ip if you want to run multiple brokers. More information about the variables you can configure on the kafka docker, please refer to this repository.
Start docker containers with compose.

cd spark-kafka-consumer/docker
docker-compose up -d

It will start the following services:

Access the container bash

docker exec -it docker_spark_1 bash

Then, get the notebook URL. Copy and paste the URL in the browser.

jupyter notebook list

Run the event-producer.ipynb notebook to start producing events from changes in Wikipedia pages to a Kafka topic. More information about the Wikipedia event here.
Run the event-consumer-spark.ipynb notebook to start consuming events from the Kafka topic and write it in parquet files.
Run the data-visualization.ipynb notebook to read the parquet files as streaming and visualize the top 10 users that have more edits.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
docker		docker
docs		docs
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md