This project demonstrates an end-to-end streaming application to generate, process, and store expense data using Kafka, Spark, and Cassandra. It also integrates with PostgreSQL for user data management.
-
Data Generator:
- Generates random expense data for users fetched from a PostgreSQL database.
- Publishes the expense data to Kafka topics.
-
Stream Handler:
- Consumes Kafka streams.
- Processes and writes the expense data to a Cassandra database.
## Project Setup
Ensure the following are installed and properly configured on your Ubuntu system:
- Java 8
- Apache Kafka
- Apache Spark (version 2.4.5)
- Apache Cassandra
- PostgreSQL
- sbt (Scala Build Tool)
https://github.com/ozmen54/SWE307-2023/tree/main/Pro1/data
Ensure PostgreSQL is running and contains a table named emp
with user data.
sudo service postgresql start
Start the Cassandra server:
./bin/cassandra -f
Start Zookeeper:
bin/zookeeper-server-start.sh config/zookeeper.properties
Start Kafka:
bin/kafka-server-start.sh config/server.properties
Create a topic to publish expense data:
bin/kafka-topics.sh --create --topic user-expenses --bootstrap-server localhost:9092
Use the data_generator.py
script to generate and publish expense data:
python3 data_generator.py
Consume messages from Kafka to verify the data:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --include "user_.*_expenses" --from-beginning
-
Package the project using sbt:
sbt package
-
Submit the Spark job:
spark-submit --class StreamHandler \ --master local[*] \ --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,\ com.datastax.cassandra:cassandra-driver-core:4.0.0,\ com.datastax.spark:spark-cassandra-connector_2.11:2.4.3 \ target/scala-2.11/stream-handler_2.11-1.0.jar
# Start PostgreSQL
sudo service postgresql start
# Start Cassandra
./bin/cassandra -f
# Start Zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
# Start Kafka
bin/kafka-server-start.sh config/server.properties
# Create Kafka Topic
bin/kafka-topics.sh --create --topic user-expenses --bootstrap-server localhost:9092
# Consume Kafka Messages
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --include "user_.*_expenses" --from-beginning
# Package with sbt
sbt package
# Submit Spark Job
spark-submit --class StreamHandler \
--master local[*] \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,\
com.datastax.cassandra:cassandra-driver-core:4.0.0,\
com.datastax.spark:spark-cassandra-connector_2.11:2.4.3 \
target/scala-2.11/stream-handler_2.11-1.0.jar
This project is part of the Kafka-Spark-Cassandra-Expense-Tracker repository.
This project is open-source and available under the MIT License.