Big Data Project

This repository contains the code and documentation for a Big Data project, focusing on analyzing, processing, and visualizing large-scale datasets. The project includes an implementation of video processing using Kafka for handling real-time data streams.

Features

The project includes:

Data Collection: Gathering structured and unstructured data from various sources.
Data Preprocessing: Cleaning, transforming, and preparing datasets for analysis.
Big Data Frameworks: Utilizing tools like Hadoop, Spark, and Kafka.
Video Processing with Kafka: Implementing real-time video processing pipelines to handle large video streams.
Analysis and Insights: Performing advanced analytics using scalable algorithms.
Visualization: Creating intuitive visualizations to represent trends and findings.

Prerequisites

To run the project, ensure you have the following installed:

Python 3.8+
Big Data tools such as Hadoop, Apache Spark, and Apache Kafka.
Python libraries:
- Pandas
- PySpark
- Matplotlib/Seaborn
- Kafka-python

Install dependencies using pip:

pip install pandas pyspark matplotlib seaborn kafka-python

How to Use

Clone the repository:

git clone https://github.com/ashithapallath/Big-Data-Project.git  
cd Big-Data-Project

Set up your environment:
- Configure Hadoop, Spark, and Kafka on your system (refer to their official documentation).
- Ensure your datasets and video files are stored in the appropriate input directories.

Run the Kafka server and create a topic for video processing:

kafka-server-start.sh config/server.properties  
kafka-topics.sh --create --topic video-stream --bootstrap-server localhost:9092

Run the producer to stream video data:
```
python src/producer.py  
```
Run the consumer to process the streamed video data:
```
python src/consumer.py  
```
Explore the visualizations and output files generated in the output directory.

Project Structure

Big-Data-Project/  
├── data/                 
├── output/                
├── src/                   
│   ├── producer.py       
│   ├── consumer.py        
├── main.py                
├── README.md               
└── requirements.txt

Example Video Processing

Kafka Producer Example:

from kafka import KafkaProducer  
import cv2  

producer = KafkaProducer(bootstrap_servers='localhost:9092')  
video = cv2.VideoCapture('data/sample_video.mp4')  

while video.isOpened():  
    ret, frame = video.read()  
    if not ret:  
        break  
    producer.send('video-stream', frame.tobytes())  

video.release()  
producer.close()

Kafka Consumer Example:

from kafka import KafkaConsumer  
import cv2  
import numpy as np  

consumer = KafkaConsumer('video-stream', bootstrap_servers='localhost:9092')  

for message in consumer:  
    frame = np.frombuffer(message.value, dtype=np.uint8)  
    # Process the frame here  
    cv2.imshow('Video Frame', frame)  
    if cv2.waitKey(1) & 0xFF == ord('q'):  
        break  

cv2.destroyAllWindows()  
consumer.close()

Contribution

Contributions are welcome!

Fork the repository.
Submit a pull request with your improvements or features.
Report issues or bugs in the Issues section.

License

This project is licensed under the MIT License.

Acknowledgments

Special thanks to the open-source community and contributors of big data tools like Apache Hadoop, Apache Spark, Apache Kafka, and Python libraries for enabling seamless data processing and analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Big Data Project

Features

Prerequisites

How to Use

Project Structure

Example Video Processing

Kafka Producer Example:

Kafka Consumer Example:

Contribution

License

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Big Data Project

Features

Prerequisites

How to Use

Project Structure

Example Video Processing

Kafka Producer Example:

Kafka Consumer Example:

Contribution

License

Acknowledgments