This is a personal data engineering project based on a hotel reviews Kaggle dataset.
Below you can find some instructions to understand the project content. Feel free to ⭐ and clone this repo 😉
- Data Analysis & Exploration: SQL/Python
- Cloud: Google Cloud Platform
- Data Lake - Google Cloud Storage
- Data Warehouse: BigQuery
- Infrastructure as Code (IaC): Terraform
- Workflow Orchestration: Prefect
- Distributed Processing: Spark
- Data Transformation: dbt
- Data Visualization: Looker Studio
- CICD: Git, dbt
The project has been structured with the following folders and files:
.github:
contains the CI/CD files (GitHub Actions)data:
raw dataset, saved parquet files and data processed using Sparkdbt:
data transformation and CI/CD pipeline using dbtflows:
workflow orchestration pipelineimages:
printouts of resultslooker:
reports from looker studionotebooks:
EDA performed at the beginning of the project to establish a baselinespark:
batch processing pipeline using sparkterraform:
IaC stream-based pipeline infrastructure in GCP using TerraformMakefile:
set of execution tasks.pre-commit-config.yaml
: pre-commit configuration filepre-commit.md:
readme file of the pre-commit hookspyproject.toml:
linting and formattingrequirements.txt:
project requirements
The dataset was obtained from Kaggle and contains various columns with hotel details and reviews of 5 countries ('Austria', 'France', 'Italy', 'Netherlands', 'Spain', 'UK'). To prepare the data an Exploratory Data Analysis was conducted. The following actions are performed either using pandas or spark to get a clean data set:
- Remove rows with NaN
- Remove duplicates
- Create a new column with the country name
Afterwards, some columns have been selected the final clean data are ingested to a GCP Bucket and Big Query. This is done either using Prefect (see flows folder), dbt (see dbt folder) or Spark (see spark folder).
Finally, to streamline the development process, a fully automated CI/CD pipeline was created using GitHub Actions and dbt as well:
The Python version used for this project is Python 3.9.
-
Clone the repo (or download it as zip):
git clone https://github.com/benitomartin/de-hotel-reviews.git
-
Create the virtual environment named
main-env
using Conda with Python version 3.9:conda create -n main-env python=3.9 conda activate main-env
-
Execute the
requirements.txt
script and install the project dependencies:pip install -r requirements.txt or make install
-
Install terraform:
conda install -c conda-forge terraform
Each project folder contains a README.md file with instructions about how to run the code. I highly recommend creating a virtual environment for each one. Additionally, please note that a GCP Account, credentials, and proper IAM roles are necessary for the scripts to function correctly. The following IAM Roles have been used for this project:
- BigQuery Admin
- BigQuery Data Editor
- BigQuery Job User
- BigQuery User
- Dataproc Administrator
- Storage Admin
- Storage Object Admin
- Storage Object Creator
- Storage Object Viewer
- Viewer
The following best practices have been implemented:
- ✅ Makefile
- ✅ CI/CD pipeline
- ✅ Linter and code formatter
- ✅ Pre-commit hooks