This repository contains Big Data Analytics course project implementation. The course was conducted at the Warsaw University of Technology during the 2021/2022 winter semester.
The goal of the project was to create a system, which will analyze data from two streaming data sources and present it to the user through a web interface. We chose the following data sources:
- Air pollution from aqicn.org
- Weather from OpenWeather
and set a goal to predict air pollution based on weather.
The system was designed in compliance with the lambda architecture principles. System architecture overview can be found below:
Speed layer contains a single component: Apache Spark. This module is directly connected to Apache Kafka component and utilizes Spark Streaming DataFrames to fetch the data. It runs two jobs:
- Machine Learning job
- Real-time on-demand predictions API job
Any output data (model evaluation, model parameters) is sent back to Kafka to be stored in Hadoop. The model which is being trained is an instance of multilayer perceptron classifier
Batch layer contains 3 components:
- Apache Nifi
- Apache Hadoop
- Apache Hive
Apache NiFi is responsible for data preprocessing, e.g.:
- Appending timestamps and ids to JSON data for it to be stored in Hadoop
- Routing flowfiles to appropriate components
- Merging multiple JSON flowfiles into one for Hadoop storing efficiency
- Converting flowfiles to ORC format
- Feeding data to Hadoop component
Apache Hadoop is used to store all the data ever seen by the system. Once a day, the data is aggregated and sent to MongoDB:
- Storing all the raw data allows for broad analysis through Apache Hive
- Aggregated data, which will be presented to the end-user, is recomputed per day and once a day, to not to overload the master dataset
Serving layer consists of 2 components:
- React frontend
- MongoDB database
By connecting to MongoDB database through an auxiliary Docker container, the user interface displays the latest data aggregations. It also displays a real-time view of model evaluation metrics. Frontend preview video can be found here.