This setup demonstrates a Spark-based data processing pipeline in Hadoop and HDFS, following the Medallion Architecture, consists of:
- Bronze Layer: Raw unstructured data.
- Silver Layer: Cleaned and transformed Parquet files.
- Gold Layer: Delta tables for analytics and updates.
The pipeline runs on OpenStack, with data stored in HDFS using Spark.
To start the setup, ensure you have:
- Access to the OpenStack environment.
- SSH key to log into the master node.
- Hadoop and Spark installed and running.
Data available in the HDFS paths:
1- Bronze:
/user/ubuntu/FinalProject/Unstructured_Price_paid_records
2- Silver:
/user/ubuntu/FinalProject/Parquet_Clean_Data
3- Gold:
/user/ubuntu/FinalProject/Delta_trainData
/user/ubuntu/FinalProject/Delta_testData
ssh ubuntu@152.94.163.65 -i ~/.ssh/ssh_key.pem
hdfs dfs -ls /user/ubuntu/FinalProject
- Check the Raw Data in Bronze Layer:
hdfs dfs -ls /user/ubuntu/FinalProject/Unstructured_Price_paid_records
- Verify Cleaned Data in Silver Layer:
hdfs dfs -ls /user/ubuntu/FinalProject/Parquet_Clean_Data
- Inspect Delta Tables in Gold Layer:
Training:
hdfs dfs -ls /user/ubuntu/FinalProject/Delta_trainData
Testing:
hdfs dfs -ls /user/ubuntu/FinalProject/Delta_testData
- Start the Spark History Server to monitor jobs:
$SPARK_HOME/sbin/start-history-server.sh
- Access the Spark UI:
http://152.94.163.65:18080
If you encounter issues, consider the following:
start-dfs.sh
systemctl list-units --type=service | grep hadoop
$HADOOP_HOME/sbin/stop-dfs.sh
$HADOOP_HOME/sbin/start-dfs.sh
hdfs dfsadmin -report
hdfs dfs -ls /user/ubuntu/FinalProject/Unstructured_Price_paid_records/
hdfs dfsadmin -safemode leave