This project is an automated ETL pipeline (Extract, Transform, Load) designed to extract commodity data (specifically Gold and Silver) from a free public API, transform the data, and load it into a structured format for analysis using AWS services. The pipeline runs on a weekly schedule, pulling fresh data using AWS Lambda functions, and storing both raw and transformed data in Amazon S3. AWS Glue is used to infer the schema, and Amazon Athena is used to query the data.
The ETL pipeline has three main phases: Extract, Transform, and Load.
- A Lambda function (
commodities_data_extraction
) pulls raw commodity data (Gold and Silver prices) from a free API and stores it in the S3 bucket under the folderraw_data/to_process
. - CloudWatch triggers this function to run on a weekly schedule.
- When new data is added to
raw_data/to_process
, an S3 event trigger fires, invoking the second Lambda function (commodities_data_transform_and_load
). - This function processes and transforms the raw data into a clean, structured format.
- The transformed data is stored in
transformed_data/
in S3.
- An AWS Glue Crawler is set up to infer the schema of the transformed data in S3.
- The schema is stored in the AWS Glue Data Catalog, which makes the data queryable through Amazon Athena.
- The pipeline is now ready for querying and analysis using SQL.
raw_data/to_process
: New raw data from the API lands here.raw_data/processed
: Once processed, the raw data is moved to this folder.transformed_data/
: Data that has been transformed and is ready for querying.
- This Lambda function is responsible for extracting raw data from the API.
- It writes the raw data to the
raw_data/to_process
folder in S3. - Trigger: Runs on a schedule (weekly) using CloudWatch Events.
- This Lambda function transforms the raw data and loads the clean data into the
transformed_data
folder in S3. - Trigger: Automatically invoked when new data is added to the
raw_data/to_process
folder (via S3 trigger).
- CloudWatch triggers the
commodities_data_extraction
Lambda function every week. - This Lambda function fetches commodity data (Gold and Silver) from the API and stores it in the
raw_data/to_process
folder in the S3 bucket. - The S3 event trigger detects new data in the
to_process
folder and invokes thecommodities_data_transform_and_load
Lambda function. - The second Lambda function transforms the raw data and stores the cleaned version in the
transformed_data
folder in S3. - AWS Glue Crawler runs periodically to update the schema in the Glue Data Catalog.
- Amazon Athena can then be used to run SQL queries on the transformed data for analytics and reporting.
-
Get Your Free API Key: Visit GoldAPI and sign up to obtain your API key.
-
Clone the Repository: Clone this project to your local machine.
-
Set Up Your Environment: Create a
.env
file inside 'For local usage' and add your API key like this:API_KEY=your_api_key
-
Install Dependencies: You'll need the pandas and requests libraries. Install them by running:
pip install pandas requests dotenv
Alternatively, feel free to use a virtual environment if you prefer.
-
Create following folders inside 'For local usage'
|- raw_data |--- to_process |--- processed |- transformed_data
-
Run the Scripts: Execute the scripts from the "For local usage" section to extract and transform data.
- You can copy the provided Lambda functions or modify them to suit your use case.
- For the extraction function, you'll need to include the requests library. A pre-packaged requests layer is included in this repository; simply rename the folder to 'python' before deploying.
- For the transformation function, you can leverage AWS's managed pandas layer to handle the data transformation.