GitHub - arvjei11/NBA-Subreddit-Data-Streaming-with-Spark-and-Kafka: This project utilizes Apache Kafka and PySpark to perform real-time named entity recognition on text data from the NBA subreddit. The extracted entities are then visualized using Elasticsearch and Kibana for insightful analytics.

This project demonstrates real-time data processing using Apache Kafka and PySpark. It reads text data from the NBA subreddit, analyzes text for named entities, and sends their counts to a Kafka topic, which is then visualized using Elasticsearch and Kibana.

Prerequisites

Ensure you have the following installed:

Python 3.x
Apache Kafka
Apache Spark
Elasticsearch and Kibana
Logstash

Python Libraries:

pyspark
spacy
pandas
jupyter-notebook
praw
json

These can be installed using pip:

pip install pyspark spacy pandas jupyter-notebook praw json

Additionally, download the necessary Spacy language model:

python -m spacy download en_core_web_sm

Setup Instructions

Start Zookeeper Service

Zookeeper is required to run Kafka. Open a terminal and execute:

bin/zookeeper-server-start.sh config/zookeeper.properties

Start Kafka Server

Open another terminal and start the Kafka server:

bin/kafka-server-start.sh config/server.properties

Stream Data from Reddit to Kafka

The reddit_to_kafka.py script streams data from the NBA subreddit to Kafka topic1. Run this script in a new terminal:

python reddit_to_kafka.py

Process Data with PySpark

Open the writer_topic2.ipynb in Jupyter Notebook to process the incoming data and write the named entity counts to Kafka topic2.

jupyter-notebook writer_topic2.ipynb

Start Elasticsearch

Run Elasticsearch from its bin directory:

bin/elasticsearch

Start Kibana

After starting Elasticsearch, run Kibana:

bin/kibana

Start Logstash

Use Logstash to pull data from Kafka topic2 into Elasticsearch. Use the provided logstash.conf file:

bin/logstash -f logstash.conf

Visualize Data with Kibana

Once Elasticsearch and Kibana are running, use Kibana to create visualizations for the data as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
Visualisation_Report.pdf		Visualisation_Report.pdf
logstash.conf		logstash.conf
reddit_to_kafka.py		reddit_to_kafka.py
writer_topic2.ipynb		writer_topic2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

arvjei11/NBA-Subreddit-Data-Streaming-with-Spark-and-Kafka

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages