Below is an enhanced version of your README. All original content is retained—only the style and formatting have been improved, plus a Table of Contents has been added. Enjoy!
Tip
Quick Overview: This repository provides a hands-on environment for practicing data ingestion, transformation, orchestration, data modeling, and other data engineering essentials.
- Introduction
- General Guidance
- Projects
- 1) COVID-19 Case Data Cleansing & Modeling
- 2) E-Commerce Sales Cleansing & Modeling
- 3) Real Estate Transactions Cleansing & Modeling
- 4) Social Media Insights with DBT (Tracking Sentiment on Elections)
- 5) Stock or Financial Data Pipeline with DBT
- 6) US Weather Events (2016–2022) with DBT
- 7) Spark-Based Log Processing with Parquet
- 8) End-to-End Data Lake (Generic Implementation)
- 9) Spark-Based Historical Tracking (Dimension Changes over Time)
- 10) Basic Machine Learning Pipeline Integration
Welcome to the Data Engineering Practice Repository! Here, you’ll find a variety of hands-on projects designed to help data engineers practice and sharpen their skills. Each project focuses on key aspects of the data engineering lifecycle: ingestion, transformation, orchestration, data modeling, and more.
- Multiple Project Scenarios: Each with instructions, goals, data sources, and sample code.
- General Guidance & Best Practices: Covering testing, CI/CD, code quality, and environment setup.
- Room for Contribution: This repo is under active development—feel free to suggest improvements, add new datasets, or propose enhancements via pull requests.
- Explore the Projects: Dive into each folder or guide to see instructions on how to set up and run them.
- Check the Guidance: See below for general notes on testing, linting, dev containers, etc.
- Contribute: We welcome contributions! Open a pull request or create an issue for bugs, improvements, or additional documentation.
Note
We’ll be updating this repository with more instructions and datasets regularly.
If you want to help us enhance the repo—through new data, tutorials, or code fixes—your contributions are greatly appreciated!
Below is an enhanced version of the General Guidance section. The original content remains the same—only the styling, formatting, and presentation have been improved to make each guideline stand out more clearly.
Important
Objective: Ensure your Python scripts (ETL, data transformations, utilities) work as intended and remain stable under changes.
Tip
Best Practices:
- Write tests that cover both expected and edge cases.
- Organize tests in a
tests/
directory, each test file namedtest_*.py
. - Keep tests small and focused on one function or behavior.
- Run tests frequently (locally, in CI) to catch regressions early.
Important
Objective: Validate interactions between your application/code and external services (e.g., Postgres, Spark, APIs) in a realistic environment.
Tip
Best Practices:
- Use a
docker-compose.yml
file to define containers for databases, data stores, or third-party services. - Spin up containers for the test environment, run your integration tests, then tear everything down.
- Keep these tests separate from unit tests, as they may be slower or more resource-intensive.
Important
Objective: Maintain consistent coding style and prevent common mistakes or code smells.
Tip
Best Practices:
- Use a linter (e.g., Flake8, Pylint) to detect issues (undefined variables, unused imports).
- Use a formatter (e.g., Black, Autopep8) to standardize indentation, spacing, and line lengths.
- Always run linting/formatting—either locally before committing or via pre-commit hooks—to ensure code consistency across the team.
Important
Objective: Automate checks that run before code is committed, preventing errors or style violations from entering the repo.
Tip
Best Practices:
- Use a framework like pre-commit to configure hooks for linting, formatting, or security checks.
- Hooks reject commits that fail lint rules or formatting.
- This ensures code always meets minimum quality standards.
Important
Objective: Automate the build, test, and deployment process so code changes are verified and released quickly and confidently.
Tip
Best Practices:
- Configure pipelines (e.g., GitHub Actions, GitLab CI, Jenkins) to run unit tests, lint checks, and integration tests.
- Fail the pipeline if any check or test fails.
- Deploy only after passing all stages, ensuring robust releases.
Important
Objective: In tests, replace (or “mock”) external services (APIs, databases) with controlled “fake” responses, keeping tests fast and isolated.
Tip
Best Practices:
- Use mocking libraries (e.g.,
unittest.mock
,pytest-mock
) to emulate external dependencies. - Verify your logic without making real network calls or changing real data.
- Combine mocking tests (fast, isolated) with integration tests (real environment) for thorough coverage.
Important
Objective: Standardize your development environment so every team member (including CI systems) uses the same dependencies, tools, and configurations.
Tip
Best Practices:
- Create a
.devcontainer/
folder with aDockerfile
or reference to a base image that includes Python, necessary libraries, and lint/test tools. - Configure VS Code (or another IDE) to use the dev container automatically, ensuring consistent local dev and easier onboarding.
- Sync the dev environment with your CI environment so the same versions of Python, libraries, and system dependencies are used everywhere.
Important
Objective: From the moment you set up a project, ensure code style is enforced consistently to avoid “style drift” and merge conflicts.
Tip
Best Practices:
- Document your lint and format processes (e.g., “Run
black . && flake8 .
before committing”). - Integrate them into your dev container configuration, pre-commit hooks, and CI.
- Use a standard config file (like
pyproject.toml
for Black) so the entire team (and CI) uses the same style rules.
[!SUMMARY]
By following these general practices, junior developers can:
- Improve reliability: Automated tests (unit + integration) catch issues early.
- Enhance maintainability: Standardized linting/formatting keeps code readable.
- Ensure consistency: Git pre-commit hooks and dev containers align local and CI environments.
- Optimize development flow: Quick feedback loops (via mocking and CI/CD) let you iterate rapidly without unexpected production issues.
Incorporate these steps into each data engineering project—whether building ETL pipelines, analytics workflows, or machine learning models—to maintain high-quality, scalable, and resilient solutions.
Note
Each project below focuses on a specific data engineering scenario. Feel free to explore them in any order, adapt the data sources, or enhance the orchestration pipelines.
Business Goal
- Objective: Provide daily and cumulative COVID-19 case, recovery, and death counts to assist medical facilities in resource planning (e.g., hospital bed allocation).
- Key Insight: By having a cleansed, modeled dataset, public health officials can easily see new daily cases and hospitalization trends.
Data & Dataset Description
- Data Source:
- Data Format: Multiple CSVs with daily counts by region (cases, deaths, recoveries).
- Data Details: Inconsistent country/state naming, negative corrections, and partial data for some dates.
Expectations
- Ingest
- Use Python or Airflow to download new daily CSVs (or pull from GitHub).
- Cleanse
- Convert date columns to a standard YYYY-MM-DD.
- Standardize location names (e.g., “US” vs. “United States”).
- Handle missing/negative values.
- Model in Postgres
- fact_covid_cases table with date, region, metrics (cases, deaths, recoveries).
- dim_date, dim_location for time-series and region-based analysis.
- Orchestrate
- An Airflow DAG that automates daily ingestion and loading into Postgres.
- Simple Dashboard
- Charts: daily new cases, cumulative metrics, 7-day moving average.
- Use Case: hospital admins see trending cases to plan ICU capacity.
Business Goal
- Objective: Enable an e-commerce store to track revenue, top products, and monthly growth accurately.
- Key Insight: Clean, reliable data helps marketing and finance forecast sales and optimize inventory.
Data & Dataset Description
- Data Source:
- Data Format: CSV with
InvoiceNo
,StockCode
,Description
,Quantity
,InvoiceDate
,UnitPrice
,CustomerID
. - Data Details: Missing
CustomerID
s, negative quantities (returns), inconsistent date/time formats.
Expectations
- Ingest
- Download CSV from UCI/Kaggle, store locally or in cloud.
- Load into Postgres using Python or an Airflow ingest task.
- Cleanse
- Fix negative
Quantity
for returns or exclude them if needed. - Convert
InvoiceDate
to a standard timestamp. - Remove duplicates or missing CustomerIDs where appropriate.
- Fix negative
- Model in Postgres
- fact_sales: invoice number, product key, date key, customer key, quantity, total amount.
- dim_product, dim_date, dim_customer for analytics.
- Orchestrate
- Airflow pipeline for monthly or weekly ingestion and transformations.
- Simple Dashboard
- Charts: daily/weekly revenue, top products, growth rate.
- Use Case: marketing sees product performance, finance monitors revenue trends.
Business Goal
- Objective: Help realtors/agencies understand property price trends (average sale price, monthly volume) for better market insights.
- Key Insight: Standardized views reveal which neighborhoods are “hot” or undervalued.
Data & Dataset Description
- Data Source:
- NYC Property Sales
- Updated link: NYC New Website
- Data Format: CSV with address, sale price, sale date, building class, etc.
- Data Details: Potential outliers ($0 sales), partial addresses, inconsistent date formats.
Expectations
- Ingest
- Download monthly/annual real estate CSV, store in local/cloud.
- Load into Postgres staging.
- Cleanse
- Remove invalid or zero sale prices (unless legitimate).
- Standardize addresses, fix building class codes, unify sale dates.
- Model in Postgres
- fact_property_sales: sale date, sale price, property ID.
- dim_property (address, building type), dim_date, dim_location (borough, ZIP).
- Orchestrate
- Airflow DAG for monthly refresh of property sales.
- Simple Dashboard
- Charts: average sale price by neighborhood, monthly sales volume.
- Use Case: realtors identify market trends, plan listings/pricing.
Business Goal
- Objective: Provide public sentiment metrics on elections (e.g., US Elections) to inform political campaigns or media outlets.
- Key Insight: Understanding how sentiment shifts over time helps tailor messaging or coverage.
Data & Dataset Description
- Data Source:
- Data Format: CSV or JSON with tweet text, user handle, timestamp, possibly location.
- Data Details: Large volumes, duplicates, spam accounts, various timestamp formats.
Expectations
- Ingest
- Pull data from Kaggle, store raw in Postgres.
- Cleanse
- Remove duplicates, unify timestamps, filter out retweets or spam.
- (Optional) run sentiment analysis script to add a sentiment column.
- Model in Postgres + DBT
- Staging: parse text fields, unify user IDs.
- Marts:
fact_tweets
with sentiment scores, key hashtags, tweet counts.
- Orchestrate
- Airflow: daily ingestion and DBT transformations.
- Simple Dashboard
- Charts: sentiment vs. time, top hashtags, tweet volume by day.
- Use Case: campaigns or media see trending sentiment on key election topics.
Business Goal
- Objective: Use Python (pandas, yfinance) and DBT to analyze daily stock price data for top 10 companies (e.g., AAPL, MSFT, AMZN, GOOGL, etc.), gaining insights into long-term trends, price volatility, and potential investment opportunities over time.
- Key Insight: Consolidating historical stock data and running transformations/tests in DBT allows traders, analysts, or finance teams to discover patterns, compare performance across companies, and create dashboards that highlight price movements and trading volume.
Data & Dataset Description
- Data Source:
- Yahoo Finance (using the
yfinance
Python package).
- Yahoo Finance (using the
- Data Format: CSV or direct DataFrame from yfinance, with columns like
Date
,Open
,High
,Low
,Close
,Adj Close
, andVolume
. - Data Details: Daily trading days (no weekends/holidays), possible missing dates (holidays), stock splits, dividends, or corporate actions might need special consideration.
Expectations
- Ingest
- Write a Python script (or an Airflow task) using
yfinance
to download daily stock data for each of your top 10 companies.
- Write a Python script (or an Airflow task) using
- Cleanse
- Standardize date formats (ensure
Date
isYYYY-MM-DD
). - Handle or flag missing values.
- Standardize date formats (ensure
- Model in Postgres + DBT
- Staging Models: unify all companies’ data, rename columns consistently.
- Marts: Create a
fact_stock_prices
table (date, ticker, open, high, low, close, volume). - Implement DBT tests for data integrity.
- Orchestrate
- Use Airflow to schedule the ingestion and DBT transformations.
- Simple Dashboard
- Charts: compare closing prices of the 10 tickers over time, daily trading volume, moving averages.
- Use Case: finance teams see which company outperforms others over certain timeframes.
Business Goal
- Objective: Build a data pipeline to ingest and analyze 8.6 million weather events across the US, enabling a deeper understanding of long-term weather patterns.
- Key Insight: By cleaning and modeling event data in a structured way, various stakeholders can detect trends (e.g., increased storm frequency), identify seasonal severity shifts, and allocate resources more effectively.
Data & Dataset Description
- Data Source:
- Data Format: CSV with columns like
EventId
,Type
(Rain, Snow, etc.),Severity
,StartTime(UTC)
, etc. - Data Details: 2,071 airport-based stations, event types like Rain, Snow, Fog, Storm, Hail, etc.
Expectations
- Ingest
- Download the large CSV (1.09 GB) from Kaggle.
- Cleanse
- Convert/standardize date/time fields.
- Validate severity fields; remove or flag invalid data.
- Model in Postgres + DBT
- Staging Models: unify column names, parse timestamps.
- Marts:
fact_weather_events
,dim_station
, incremental DBT model if needed.
- Orchestrate
- Schedule a pipeline via Airflow for monthly or historical backfill.
- Simple Dashboard
- Charts: count of events by type over time, severity distribution by region, average precipitation by month/year.
Business Goal
- Objective: Analyze web server access logs at scale to identify traffic spikes, error patterns, and popular resources.
- Key Insight: DevOps and business stakeholders can leverage log insights to optimize site performance, detect security issues, and enhance user experience.
Data & Dataset Description
- Data Source:
- Web Server Access Logs on Kaggle (~3.3GB).
- Data Format: Nginx server access logs (IP, timestamp, HTTP method, endpoint, etc.).
- Data Details: Bot traffic, real user sessions, possible error codes.
Expectations
- Ingest
- Download compressed logs from Kaggle.
- Store in local or HDFS/S3.
- Use Spark to parse log lines.
- Cleanse
- Extract columns: IP, timestamp, request path, status code, etc.
- Filter malformed lines or large outliers.
- Write to Parquet
- Convert to structured schema, partition by date or status code.
- Orchestrate
- Airflow to automate daily/weekly.
- Simple Dashboard
- Charts: request volume over time, distribution of HTTP status codes, top endpoints.
- Use Case: DevOps monitors traffic spikes or error surges.
Business Goal
- Objective: Establish a scalable, cost-efficient repository for large datasets (e.g., NYC Taxi data) for ad-hoc analytics and historical trend exploration.
- Key Insight: Users can query historical data on demand, quickly producing insights without the overhead of a traditional data warehouse.
Data & Dataset Description
- Data Source (Examples):
- NYC Taxi Trip Records,
- data.gov open datasets.
- Data Format: CSV or Parquet, potentially gigabytes to terabytes.
- Data Details: Often monthly or quarterly releases, partitioning is crucial.
Expectations
- Ingest
- Store raw files in a landing zone (cloud bucket, on-prem).
- Cleanse & Curate
- Use Spark to read raw data, fix data types, handle outliers.
- Convert to partitioned Parquet in a curated zone.
- Orchestrate
- Airflow or similar workflow for scheduled tasks.
- (Optional) Model in a Relational DB
- For aggregated or frequently accessed summaries.
- Simple Dashboard & Business Queries
- Query Engine: Presto, Hive, Trino, Spark SQL, or cloud-native (Athena, BigQuery).
- Use Case: city planners, taxi companies, and policy makers can quickly analyze usage trends.
Business Goal
- Objective: Maintain full historical records of changing electronic product details (e.g., prices, availability) for accurate point-in-time analysis.
- Key Insight: By storing every change rather than overwriting records, teams can analyze how product prices/conditions evolve over time.
Data & Dataset Description
- Dataset:
- Data Format: CSV with multiple columns such as price, condition, dateSeen, merchant.
- Data Details: A single
id
might appear multiple times with different prices or conditions.
Expectations
- Ingest
- Periodically retrieve CSV from Kaggle.
- Cleanse
- Validate
id
, convertprices.dateSeen
to timestamp.
- Validate
- Write to Parquet with Historical Tracking
- Append new entries instead of overwriting for each product’s snapshot in time.
- Orchestrate
- Airflow to automate CSV retrieval and Spark jobs.
- Simple Dashboard
- Charts: price trend over time, condition vs. price, merchant competition.
- Use Case: e-commerce analysts see historical pricing to decide on promotions or competitor matching.
Business Goal
- Objective: Predict customer churn to help telecom providers improve retention strategies or cross-sell opportunities.
- Key Insight: Storing model outputs in Postgres lets business teams query high-risk customers and compare predicted outcomes to actual behavior.
Data & Dataset Description
- Data Source:
- Data Format & Columns:
customerID
,gender
,tenure
,InternetService
, etc. - Data Details: Missing values, categorical variables needing encoding, outliers in numeric columns.
Expectations
- Ingest
- Download CSV and store locally or in cloud.
- Load into Postgres staging.
- Cleanse
- Handle missing data, encode categorical fields.
- Detect/remove outliers.
- Model in Postgres
- Store cleaned dataset in
fact_churn_features
. - Train a classification model (e.g., logistic regression).
- Write predictions back to a Postgres table.
- Store cleaned dataset in
- Orchestrate
- Airflow to automate data prep, model training, and prediction generation.
- Simple Dashboard
- Charts: Churn rate by segment, predicted vs. actual churn.
- Use Case: Marketing or retention teams filter high-risk customers for proactive promotions.
Warning
Always verify licensing for any public dataset you use in production. Some have non-commercial clauses or require attribution.
Caution
When working with personally identifiable information (PII) or sensitive data, follow GDPR, CCPA, or relevant privacy regulations. Failure to comply can result in legal and financial penalties.
Happy Data Engineering!
Feel free to open a PR or an issue with any questions or suggestions.