Predicting air pollution based on weather
Big Data Analytics course project

This repository contains Big Data Analytics course project implementation. The course was conducted at the Warsaw University of Technology during the 2021/2022 winter semester.

Project goals description

The goal of the project was to create a system, which will analyze data from two streaming data sources and present it to the user through a web interface. We chose the following data sources:

Air pollution from aqicn.org
Weather from OpenWeather

and set a goal to predict air pollution based on weather.

System architecture

The system was designed in compliance with the lambda architecture principles. System architecture overview can be found below:

Speed layer

Speed layer contains a single component: Apache Spark. This module is directly connected to Apache Kafka component and utilizes Spark Streaming DataFrames to fetch the data. It runs two jobs:

Machine Learning job
Real-time on-demand predictions API job

Any output data (model evaluation, model parameters) is sent back to Kafka to be stored in Hadoop. The model which is being trained is an instance of multilayer perceptron classifier

Batch layer

Batch layer contains 3 components:

Apache Nifi
Apache Hadoop
Apache Hive

Apache NiFi is responsible for data preprocessing, e.g.:

Appending timestamps and ids to JSON data for it to be stored in Hadoop
Routing flowfiles to appropriate components
Merging multiple JSON flowfiles into one for Hadoop storing efficiency
Converting flowfiles to ORC format
Feeding data to Hadoop component

Apache Hadoop is used to store all the data ever seen by the system. Once a day, the data is aggregated and sent to MongoDB:

Storing all the raw data allows for broad analysis through Apache Hive
Aggregated data, which will be presented to the end-user, is recomputed per day and once a day, to not to overload the master dataset

Serving layer

Serving layer consists of 2 components:

React frontend
MongoDB database

By connecting to MongoDB database through an auxiliary Docker container, the user interface displays the latest data aggregations. It also displays a real-time view of model evaluation metrics. Frontend preview video can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
backend		backend
data-aggregation		data-aggregation
frontend		frontend
historical-data		historical-data
hive-server		hive-server
hive		hive
kafka		kafka
milestone_2_docs		milestone_2_docs
model-uploader		model-uploader
nifi		nifi
spark		spark
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
nifi-template.xml		nifi-template.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting air pollution based on weather
Big Data Analytics course project

Project goals description

System architecture

Speed layer

Batch layer

Serving layer

About

Releases

Packages

Languages

Zackere/bda

Folders and files

Latest commit

History

Repository files navigation

Predicting air pollution based on weather Big Data Analytics course project

Project goals description

System architecture

Speed layer

Batch layer

Serving layer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Predicting air pollution based on weather
Big Data Analytics course project

Packages