This repository contains the code and documentation for a Big Data project, focusing on analyzing, processing, and visualizing large-scale datasets. The project includes an implementation of video processing using Kafka for handling real-time data streams.
The project includes:
- Data Collection: Gathering structured and unstructured data from various sources.
- Data Preprocessing: Cleaning, transforming, and preparing datasets for analysis.
- Big Data Frameworks: Utilizing tools like Hadoop, Spark, and Kafka.
- Video Processing with Kafka: Implementing real-time video processing pipelines to handle large video streams.
- Analysis and Insights: Performing advanced analytics using scalable algorithms.
- Visualization: Creating intuitive visualizations to represent trends and findings.
To run the project, ensure you have the following installed:
- Python 3.8+
- Big Data tools such as Hadoop, Apache Spark, and Apache Kafka.
- Python libraries:
- Pandas
- PySpark
- Matplotlib/Seaborn
- Kafka-python
Install dependencies using pip:
pip install pandas pyspark matplotlib seaborn kafka-python
-
Clone the repository:
git clone https://github.com/ashithapallath/Big-Data-Project.git cd Big-Data-Project
-
Set up your environment:
- Configure Hadoop, Spark, and Kafka on your system (refer to their official documentation).
- Ensure your datasets and video files are stored in the appropriate input directories.
-
Run the Kafka server and create a topic for video processing:
kafka-server-start.sh config/server.properties kafka-topics.sh --create --topic video-stream --bootstrap-server localhost:9092
-
Run the producer to stream video data:
python src/producer.py
-
Run the consumer to process the streamed video data:
python src/consumer.py
-
Explore the visualizations and output files generated in the output directory.
Big-Data-Project/
├── data/
├── output/
├── src/
│ ├── producer.py
│ ├── consumer.py
├── main.py
├── README.md
└── requirements.txt
from kafka import KafkaProducer
import cv2
producer = KafkaProducer(bootstrap_servers='localhost:9092')
video = cv2.VideoCapture('data/sample_video.mp4')
while video.isOpened():
ret, frame = video.read()
if not ret:
break
producer.send('video-stream', frame.tobytes())
video.release()
producer.close()
from kafka import KafkaConsumer
import cv2
import numpy as np
consumer = KafkaConsumer('video-stream', bootstrap_servers='localhost:9092')
for message in consumer:
frame = np.frombuffer(message.value, dtype=np.uint8)
# Process the frame here
cv2.imshow('Video Frame', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cv2.destroyAllWindows()
consumer.close()
Contributions are welcome!
- Fork the repository.
- Submit a pull request with your improvements or features.
- Report issues or bugs in the Issues section.
This project is licensed under the MIT License.
Special thanks to the open-source community and contributors of big data tools like Apache Hadoop, Apache Spark, Apache Kafka, and Python libraries for enabling seamless data processing and analysis.