π CSV Data Preprocessing with Automation
Project Overview
This project automates the preprocessing of CSV files by filtering, cleaning, and storing the data efficiently. It processes multiple CSV files in a directory and saves the cleaned outputs with dynamically generated filenames based on the year present in the input filenames. The cleaned data is stored both as CSV files and in an SQLite database for structured querying.
π― Features
β Batch Processing: Processes all CSV files in a directory automatically.
β Data Filtering: Removes rows with missing 'Date' or 'Product' values.
β Missing Value Imputation: Fills missing 'Sales' and 'Revenue' using column median.
β Database Storage: Saves processed data into an SQLite database for structured retrieval.
β Dynamic Output Naming: Generates filenames as cleaned_.csv.
β Automatic Directory Handling: Creates an output directory if it doesnβt exist.
β Logging System: Tracks each processing step and errors for debugging.
π Logging System
All logs are stored in data_preprocessing.log, tracking:
πΉ INFO: Normal operations like filtering and saving. πΉ WARNING: Non-critical issues such as missing values are handled with imputation. πΉ ERROR: Failures like missing input files or database errors.
π Project Structure
π Dat_Preprocessor
βββ filter.py # Filters data based on required columns
βββ imputation.py # Handles missing value imputation
βββ database.py # Manages SQLite database operations
βββ logger.py # Implements logging system
βββ main.py # CLI entry point for execution
βββ requirements.txt # Dependencies
βββ README.md # Project documentation
π Installation & Usage
1οΈβ£ Setup Environment
python -m venv venv
source venv/bin/activate # On Windows use venv\Scripts\activate
pip install -r requirements.txt
2οΈβ£ Run the Script
python main.py files/ output_files/
3οΈβ£ Query Cleaned Data (Example SQLite Query)
SELECT * FROM sales_data WHERE Product = 'Laptop';
π€ Why CSV Preprocessing is Important?
πΉ Data Cleaning & Transformation: Ensures structured and error-free data. πΉ Automated Processing: Saves time by handling large datasets efficiently. πΉ Better Analysis & Machine Learning: Provides clean data for insightful analysis. πΉ Database Storage: Enables structured querying and retrieval of processed data.
This project provides an automated, scalable, and efficient solution for CSV data preprocessing, making data ready for analysis and machine learning tasks