Industry-Level Azure Data Engineering Project

This project demonstrates an end-to-end Azure Data Engineering solution based on a common business use case. The goal is to transfer data from an Amazon S3 bucket to an Azure Blob container using Azure Data Factory. Subsequently, we'll leverage Azure Databricks to mount the Azure Blob and perform data analytics using Spark SQL.

Business use case:

We have data coming into Amazon S3 location from external resources and we want to move data to Azure Blob container by using Azure Data Factory pipeline. Once data transmitted we need to mount it to Databricks and do analysis by using Spark SQL.

Architecture

Project Workflow

Create Amazon S3 Bucket:
- Set up an Amazon S3 bucket to act as the source for our data.
Create Azure Blob Storage:
- Set up an Azure Blob Storage account to store data transferred from Amazon S3.
Azure Data Factory Pipeline:
- Develop an Azure Data Factory pipeline to efficiently move data from the Amazon S3 bucket to the Azure Blob container.
Mount Azure Blob Storage to Databricks:
- Establish a connection between Azure Databricks and the Azure Blob Storage to facilitate data access.
Data Analytics with Spark SQL:
- Leverage Spark SQL on Azure Databricks to perform advanced analytics on the mounted data.

Steps to Replicate:

1. Amazon S3 Bucket Setup:

Set up an Amazon S3 bucket using the AWS Management Console.

2. Azure Blob Storage Setup:

Follow the Azure documentation to create a Blob Storage account.

3. Azure Data Factory Pipeline:

Develop an Azure Data Factory pipeline to copy data from Amazon S3 to Azure Blob.
Utilize appropriate connectors and ensure data integrity during the transfer.

4. Mount Azure Blob to Databricks:

Configure Azure Databricks to mount the Azure Blob Storage as a DBFS (Databricks File System) directory.

dbutils.fs.mount(
source="wasbs://raw@blobstoragesuperstore.blob.core.windows.net",
mount_point= "/mnt/raw",
extra_configs={"fs.azure.account.key.blobstoragesuperstore.blob.core.windows.net":"ACCESS_KEY"}
)

5. Spark SQL Analytics:

Leverage Spark SQL on Databricks to analyze and derive insights from the transferred data.
Please refer this SQL Notebook for the Spark SQL queries

Additional Resources:

Feel free to contribute, provide feedback, or adapt this project to suit your specific use cases! 🚀🔍 #Azure #DataEngineering #SparkSQL #ETL #Databricks

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
ADF_pipeline.png		ADF_pipeline.png
Architecture.png		Architecture.png
README.md		README.md
superstore notebook.py		superstore notebook.py
superstore notebook.sql		superstore notebook.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Industry-Level Azure Data Engineering Project

Business use case:

Architecture

Project Workflow

Steps to Replicate:

1. Amazon S3 Bucket Setup:

2. Azure Blob Storage Setup:

3. Azure Data Factory Pipeline:

4. Mount Azure Blob to Databricks:

5. Spark SQL Analytics:

Additional Resources:

About

Languages

shubhammirajkar/superstore_azure_de_project

Folders and files

Latest commit

History

Repository files navigation

Industry-Level Azure Data Engineering Project

Business use case:

Architecture

Project Workflow

Steps to Replicate:

1. Amazon S3 Bucket Setup:

2. Azure Blob Storage Setup:

3. Azure Data Factory Pipeline:

4. Mount Azure Blob to Databricks:

5. Spark SQL Analytics:

Additional Resources:

About

Topics

Resources

Stars

Watchers

Forks

Languages