Data Engineering Practice Repository

Below is an enhanced version of your README. All original content is retained—only the style and formatting have been improved, plus a Table of Contents has been added. Enjoy!

Data Engineering Practice Repository

Tip

Quick Overview: This repository provides a hands-on environment for practicing data ingestion, transformation, orchestration, data modeling, and other data engineering essentials.

Introduction

Welcome to the Data Engineering Practice Repository! Here, you’ll find a variety of hands-on projects designed to help data engineers practice and sharpen their skills. Each project focuses on key aspects of the data engineering lifecycle: ingestion, transformation, orchestration, data modeling, and more.

What’s in This Repo?

Multiple Project Scenarios: Each with instructions, goals, data sources, and sample code.
General Guidance & Best Practices: Covering testing, CI/CD, code quality, and environment setup.
Room for Contribution: This repo is under active development—feel free to suggest improvements, add new datasets, or propose enhancements via pull requests.

Next Steps

Explore the Projects: Dive into each folder or guide to see instructions on how to set up and run them.
Check the Guidance: See below for general notes on testing, linting, dev containers, etc.
Contribute: We welcome contributions! Open a pull request or create an issue for bugs, improvements, or additional documentation.

Note

We’ll be updating this repository with more instructions and datasets regularly.
If you want to help us enhance the repo—through new data, tutorials, or code fixes—your contributions are greatly appreciated!

General Guidance

Below is an enhanced version of the General Guidance section. The original content remains the same—only the styling, formatting, and presentation have been improved to make each guideline stand out more clearly.

1. Testing Python Code with `pytest`

Important

Objective: Ensure your Python scripts (ETL, data transformations, utilities) work as intended and remain stable under changes.

Tip

Best Practices:

Write tests that cover both expected and edge cases.
Organize tests in a tests/ directory, each test file named test_*.py.
Keep tests small and focused on one function or behavior.
Run tests frequently (locally, in CI) to catch regressions early.

2. Integration Tests with `docker-compose`

Important

Objective: Validate interactions between your application/code and external services (e.g., Postgres, Spark, APIs) in a realistic environment.

Tip

Best Practices:

Use a docker-compose.yml file to define containers for databases, data stores, or third-party services.
Spin up containers for the test environment, run your integration tests, then tear everything down.
Keep these tests separate from unit tests, as they may be slower or more resource-intensive.

3. Code Quality: Linting and Formatting

Important

Objective: Maintain consistent coding style and prevent common mistakes or code smells.

Tip

Best Practices:

Use a linter (e.g., Flake8, Pylint) to detect issues (undefined variables, unused imports).
Use a formatter (e.g., Black, Autopep8) to standardize indentation, spacing, and line lengths.
Always run linting/formatting—either locally before committing or via pre-commit hooks—to ensure code consistency across the team.

4. Git Pre-Commit Hooks

Important

Objective: Automate checks that run before code is committed, preventing errors or style violations from entering the repo.

Tip

Best Practices:

Use a framework like pre-commit to configure hooks for linting, formatting, or security checks.
Hooks reject commits that fail lint rules or formatting.
This ensures code always meets minimum quality standards.

5. Always Deploy Using CI/CD (with Tests & Integration Tests)

Important

Objective: Automate the build, test, and deployment process so code changes are verified and released quickly and confidently.

Tip

Best Practices:

Configure pipelines (e.g., GitHub Actions, GitLab CI, Jenkins) to run unit tests, lint checks, and integration tests.
Fail the pipeline if any check or test fails.
Deploy only after passing all stages, ensuring robust releases.

6. Try Mocking Tests

Important

Objective: In tests, replace (or “mock”) external services (APIs, databases) with controlled “fake” responses, keeping tests fast and isolated.

Tip

Best Practices:

Use mocking libraries (e.g., unittest.mock, pytest-mock) to emulate external dependencies.
Verify your logic without making real network calls or changing real data.
Combine mocking tests (fast, isolated) with integration tests (real environment) for thorough coverage.

7. Use Dev Containers (Devcontainer)

Important

Objective: Standardize your development environment so every team member (including CI systems) uses the same dependencies, tools, and configurations.

Tip

Best Practices:

Create a .devcontainer/ folder with a Dockerfile or reference to a base image that includes Python, necessary libraries, and lint/test tools.
Configure VS Code (or another IDE) to use the dev container automatically, ensuring consistent local dev and easier onboarding.
Sync the dev environment with your CI environment so the same versions of Python, libraries, and system dependencies are used everywhere.

8. Always Put Linting and Formatting Front and Center

Important

Objective: From the moment you set up a project, ensure code style is enforced consistently to avoid “style drift” and merge conflicts.

Tip

Best Practices:

Document your lint and format processes (e.g., “Run black . && flake8 . before committing”).
Integrate them into your dev container configuration, pre-commit hooks, and CI.
Use a standard config file (like pyproject.toml for Black) so the entire team (and CI) uses the same style rules.

[!SUMMARY]
By following these general practices, junior developers can:

Improve reliability: Automated tests (unit + integration) catch issues early.

Enhance maintainability: Standardized linting/formatting keeps code readable.

Ensure consistency: Git pre-commit hooks and dev containers align local and CI environments.

Optimize development flow: Quick feedback loops (via mocking and CI/CD) let you iterate rapidly without unexpected production issues.

Incorporate these steps into each data engineering project—whether building ETL pipelines, analytics workflows, or machine learning models—to maintain high-quality, scalable, and resilient solutions.

Projects

Note

Each project below focuses on a specific data engineering scenario. Feel free to explore them in any order, adapt the data sources, or enhance the orchestration pipelines.

1) COVID-19 Case Data Cleansing & Modeling

Business Goal

Objective: Provide daily and cumulative COVID-19 case, recovery, and death counts to assist medical facilities in resource planning (e.g., hospital bed allocation).
Key Insight: By having a cleansed, modeled dataset, public health officials can easily see new daily cases and hospitalization trends.

Data & Dataset Description

Data Source:
- Johns Hopkins CSSE COVID-19 Data
- Kaggle COVID-19 Datasets
Data Format: Multiple CSVs with daily counts by region (cases, deaths, recoveries).
Data Details: Inconsistent country/state naming, negative corrections, and partial data for some dates.

Expectations

Ingest
- Use Python or Airflow to download new daily CSVs (or pull from GitHub).
Cleanse
- Convert date columns to a standard YYYY-MM-DD.
- Standardize location names (e.g., “US” vs. “United States”).
- Handle missing/negative values.
Model in Postgres
- fact_covid_cases table with date, region, metrics (cases, deaths, recoveries).
- dim_date, dim_location for time-series and region-based analysis.
Orchestrate
- An Airflow DAG that automates daily ingestion and loading into Postgres.
Simple Dashboard
- Charts: daily new cases, cumulative metrics, 7-day moving average.
- Use Case: hospital admins see trending cases to plan ICU capacity.

2) E-Commerce Sales Cleansing & Modeling

Business Goal

Objective: Enable an e-commerce store to track revenue, top products, and monthly growth accurately.
Key Insight: Clean, reliable data helps marketing and finance forecast sales and optimize inventory.

Data & Dataset Description

Data Source:
- UCI Online Retail Data
- Kaggle E-Commerce Datasets
Data Format: CSV with InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID.
Data Details: Missing CustomerIDs, negative quantities (returns), inconsistent date/time formats.

Expectations

Ingest
- Download CSV from UCI/Kaggle, store locally or in cloud.
- Load into Postgres using Python or an Airflow ingest task.
Cleanse
- Fix negative Quantity for returns or exclude them if needed.
- Convert InvoiceDate to a standard timestamp.
- Remove duplicates or missing CustomerIDs where appropriate.
Model in Postgres
- fact_sales: invoice number, product key, date key, customer key, quantity, total amount.
- dim_product, dim_date, dim_customer for analytics.
Orchestrate
- Airflow pipeline for monthly or weekly ingestion and transformations.
Simple Dashboard
- Charts: daily/weekly revenue, top products, growth rate.
- Use Case: marketing sees product performance, finance monitors revenue trends.

3) Real Estate Transactions Cleansing & Modeling

Business Goal

Objective: Help realtors/agencies understand property price trends (average sale price, monthly volume) for better market insights.
Key Insight: Standardized views reveal which neighborhoods are “hot” or undervalued.

Data & Dataset Description

Data Source:
- NYC Property Sales
- Updated link: NYC New Website
Data Format: CSV with address, sale price, sale date, building class, etc.
Data Details: Potential outliers ($0 sales), partial addresses, inconsistent date formats.

Expectations

Ingest
- Download monthly/annual real estate CSV, store in local/cloud.
- Load into Postgres staging.
Cleanse
- Remove invalid or zero sale prices (unless legitimate).
- Standardize addresses, fix building class codes, unify sale dates.
Model in Postgres
- fact_property_sales: sale date, sale price, property ID.
- dim_property (address, building type), dim_date, dim_location (borough, ZIP).
Orchestrate
- Airflow DAG for monthly refresh of property sales.
Simple Dashboard
- Charts: average sale price by neighborhood, monthly sales volume.
- Use Case: realtors identify market trends, plan listings/pricing.

4) Social Media Insights with DBT (Tracking Sentiment on Elections)

Business Goal

Objective: Provide public sentiment metrics on elections (e.g., US Elections) to inform political campaigns or media outlets.
Key Insight: Understanding how sentiment shifts over time helps tailor messaging or coverage.

Data & Dataset Description

Data Source:
- Kaggle: “U.S. Election 2020 Tweets”
Data Format: CSV or JSON with tweet text, user handle, timestamp, possibly location.
Data Details: Large volumes, duplicates, spam accounts, various timestamp formats.

Expectations

Ingest
- Pull data from Kaggle, store raw in Postgres.
Cleanse
- Remove duplicates, unify timestamps, filter out retweets or spam.
- (Optional) run sentiment analysis script to add a sentiment column.
Model in Postgres + DBT
- Staging: parse text fields, unify user IDs.
- Marts: fact_tweets with sentiment scores, key hashtags, tweet counts.
Orchestrate
- Airflow: daily ingestion and DBT transformations.
Simple Dashboard
- Charts: sentiment vs. time, top hashtags, tweet volume by day.
- Use Case: campaigns or media see trending sentiment on key election topics.

5) Stock or Financial Data Pipeline with DBT

Business Goal

Objective: Use Python (pandas, yfinance) and DBT to analyze daily stock price data for top 10 companies (e.g., AAPL, MSFT, AMZN, GOOGL, etc.), gaining insights into long-term trends, price volatility, and potential investment opportunities over time.
Key Insight: Consolidating historical stock data and running transformations/tests in DBT allows traders, analysts, or finance teams to discover patterns, compare performance across companies, and create dashboards that highlight price movements and trading volume.

Data & Dataset Description

Data Source:
- Yahoo Finance (using the yfinance Python package).
Data Format: CSV or direct DataFrame from yfinance, with columns like Date, Open, High, Low, Close, Adj Close, and Volume.
Data Details: Daily trading days (no weekends/holidays), possible missing dates (holidays), stock splits, dividends, or corporate actions might need special consideration.

Expectations

Ingest
- Write a Python script (or an Airflow task) using yfinance to download daily stock data for each of your top 10 companies.
Cleanse
- Standardize date formats (ensure Date is YYYY-MM-DD).
- Handle or flag missing values.
Model in Postgres + DBT
- Staging Models: unify all companies’ data, rename columns consistently.
- Marts: Create a fact_stock_prices table (date, ticker, open, high, low, close, volume).
- Implement DBT tests for data integrity.
Orchestrate
- Use Airflow to schedule the ingestion and DBT transformations.
Simple Dashboard
- Charts: compare closing prices of the 10 tickers over time, daily trading volume, moving averages.
- Use Case: finance teams see which company outperforms others over certain timeframes.

6) US Weather Events (2016–2022) with DBT

Business Goal

Objective: Build a data pipeline to ingest and analyze 8.6 million weather events across the US, enabling a deeper understanding of long-term weather patterns.
Key Insight: By cleaning and modeling event data in a structured way, various stakeholders can detect trends (e.g., increased storm frequency), identify seasonal severity shifts, and allocate resources more effectively.

Data & Dataset Description

Data Source:
- US Weather Events (2016–2022) on Kaggle
Data Format: CSV with columns like EventId, Type (Rain, Snow, etc.), Severity, StartTime(UTC), etc.
Data Details: 2,071 airport-based stations, event types like Rain, Snow, Fog, Storm, Hail, etc.

Expectations

Ingest
- Download the large CSV (1.09 GB) from Kaggle.
Cleanse
- Convert/standardize date/time fields.
- Validate severity fields; remove or flag invalid data.
Model in Postgres + DBT
- Staging Models: unify column names, parse timestamps.
- Marts: fact_weather_events, dim_station, incremental DBT model if needed.
Orchestrate
- Schedule a pipeline via Airflow for monthly or historical backfill.
Simple Dashboard
- Charts: count of events by type over time, severity distribution by region, average precipitation by month/year.

7) Spark-Based Log Processing with Parquet

Business Goal

Objective: Analyze web server access logs at scale to identify traffic spikes, error patterns, and popular resources.
Key Insight: DevOps and business stakeholders can leverage log insights to optimize site performance, detect security issues, and enhance user experience.

Data & Dataset Description

Data Source:
- Web Server Access Logs on Kaggle (~3.3GB).
Data Format: Nginx server access logs (IP, timestamp, HTTP method, endpoint, etc.).
Data Details: Bot traffic, real user sessions, possible error codes.

Expectations

Ingest
- Download compressed logs from Kaggle.
- Store in local or HDFS/S3.
- Use Spark to parse log lines.
Cleanse
- Extract columns: IP, timestamp, request path, status code, etc.
- Filter malformed lines or large outliers.
Write to Parquet
- Convert to structured schema, partition by date or status code.
Orchestrate
- Airflow to automate daily/weekly.
Simple Dashboard
- Charts: request volume over time, distribution of HTTP status codes, top endpoints.
- Use Case: DevOps monitors traffic spikes or error surges.

8) End-to-End Data Lake (Generic Implementation)

Business Goal

Objective: Establish a scalable, cost-efficient repository for large datasets (e.g., NYC Taxi data) for ad-hoc analytics and historical trend exploration.
Key Insight: Users can query historical data on demand, quickly producing insights without the overhead of a traditional data warehouse.

Data & Dataset Description

Data Source (Examples):
- NYC Taxi Trip Records,
- data.gov open datasets.
Data Format: CSV or Parquet, potentially gigabytes to terabytes.
Data Details: Often monthly or quarterly releases, partitioning is crucial.

Expectations

Ingest
- Store raw files in a landing zone (cloud bucket, on-prem).
Cleanse & Curate
- Use Spark to read raw data, fix data types, handle outliers.
- Convert to partitioned Parquet in a curated zone.
Orchestrate
- Airflow or similar workflow for scheduled tasks.
(Optional) Model in a Relational DB
- For aggregated or frequently accessed summaries.
Simple Dashboard & Business Queries
- Query Engine: Presto, Hive, Trino, Spark SQL, or cloud-native (Athena, BigQuery).
- Use Case: city planners, taxi companies, and policy makers can quickly analyze usage trends.

9) Spark-Based Historical Tracking (Dimension Changes over Time)

Business Goal

Objective: Maintain full historical records of changing electronic product details (e.g., prices, availability) for accurate point-in-time analysis.
Key Insight: By storing every change rather than overwriting records, teams can analyze how product prices/conditions evolve over time.

Data & Dataset Description

Dataset:
- Electronic Products and Pricing Data
Data Format: CSV with multiple columns such as price, condition, dateSeen, merchant.
Data Details: A single id might appear multiple times with different prices or conditions.

Expectations

Ingest
- Periodically retrieve CSV from Kaggle.
Cleanse
- Validate id, convert prices.dateSeen to timestamp.
Write to Parquet with Historical Tracking
- Append new entries instead of overwriting for each product’s snapshot in time.
Orchestrate
- Airflow to automate CSV retrieval and Spark jobs.
Simple Dashboard
- Charts: price trend over time, condition vs. price, merchant competition.
- Use Case: e-commerce analysts see historical pricing to decide on promotions or competitor matching.

10) Basic Machine Learning Pipeline Integration

Business Goal

Objective: Predict customer churn to help telecom providers improve retention strategies or cross-sell opportunities.
Key Insight: Storing model outputs in Postgres lets business teams query high-risk customers and compare predicted outcomes to actual behavior.

Data & Dataset Description

Data Source:
- Telco Customer Churn (IBM)
Data Format & Columns: customerID, gender, tenure, InternetService, etc.
Data Details: Missing values, categorical variables needing encoding, outliers in numeric columns.

Expectations

Ingest
- Download CSV and store locally or in cloud.
- Load into Postgres staging.
Cleanse
- Handle missing data, encode categorical fields.
- Detect/remove outliers.
Model in Postgres
- Store cleaned dataset in fact_churn_features.
- Train a classification model (e.g., logistic regression).
- Write predictions back to a Postgres table.
Orchestrate
- Airflow to automate data prep, model training, and prediction generation.
Simple Dashboard
- Charts: Churn rate by segment, predicted vs. actual churn.
- Use Case: Marketing or retention teams filter high-risk customers for proactive promotions.

Warning

Always verify licensing for any public dataset you use in production. Some have non-commercial clauses or require attribution.

Caution

When working with personally identifiable information (PII) or sensitive data, follow GDPR, CCPA, or relevant privacy regulations. Failure to comply can result in legal and financial penalties.

Happy Data Engineering!
Feel free to open a PR or an issue with any questions or suggestions.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Practice Repository

Table of Contents

Introduction

What’s in This Repo?

Next Steps

General Guidance

1. Testing Python Code with `pytest`

2. Integration Tests with `docker-compose`

3. Code Quality: Linting and Formatting

4. Git Pre-Commit Hooks

5. Always Deploy Using CI/CD (with Tests & Integration Tests)

6. Try Mocking Tests

7. Use Dev Containers (Devcontainer)

8. Always Put Linting and Formatting Front and Center

Projects

1) COVID-19 Case Data Cleansing & Modeling

2) E-Commerce Sales Cleansing & Modeling

3) Real Estate Transactions Cleansing & Modeling

4) Social Media Insights with DBT (Tracking Sentiment on Elections)

5) Stock or Financial Data Pipeline with DBT

6) US Weather Events (2016–2022) with DBT

7) Spark-Based Log Processing with Parquet

8) End-to-End Data Lake (Generic Implementation)

9) Spark-Based Historical Tracking (Dimension Changes over Time)

10) Basic Machine Learning Pipeline Integration

About

Releases

Packages

License

garage-education/data-engineering-projects

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Practice Repository

Table of Contents

Introduction

What’s in This Repo?

Next Steps

General Guidance

1. Testing Python Code with pytest

2. Integration Tests with docker-compose

3. Code Quality: Linting and Formatting

4. Git Pre-Commit Hooks

5. Always Deploy Using CI/CD (with Tests & Integration Tests)

6. Try Mocking Tests

7. Use Dev Containers (Devcontainer)

8. Always Put Linting and Formatting Front and Center

Projects

1) COVID-19 Case Data Cleansing & Modeling

2) E-Commerce Sales Cleansing & Modeling

3) Real Estate Transactions Cleansing & Modeling

4) Social Media Insights with DBT (Tracking Sentiment on Elections)

5) Stock or Financial Data Pipeline with DBT

6) US Weather Events (2016–2022) with DBT

7) Spark-Based Log Processing with Parquet

8) End-to-End Data Lake (Generic Implementation)

9) Spark-Based Historical Tracking (Dimension Changes over Time)

10) Basic Machine Learning Pipeline Integration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

1. Testing Python Code with `pytest`

2. Integration Tests with `docker-compose`

Packages