Skip to content

This project is a CSV Data Preprocessing Pipeline that automates the cleaning of sales data. It reads multiple CSV files from a specified directory, processes them, and saves the cleaned outputs with dynamically generated filenames based on the year found in the input filenames.

Notifications You must be signed in to change notification settings

SHossainP/Data-Refiner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Data-Refiner

πŸ“Œ CSV Data Preprocessing with Automation

Project Overview

This project automates the preprocessing of CSV files by filtering, cleaning, and storing the data efficiently. It processes multiple CSV files in a directory and saves the cleaned outputs with dynamically generated filenames based on the year present in the input filenames. The cleaned data is stored both as CSV files and in an SQLite database for structured querying.

🎯 Features

βœ… Batch Processing: Processes all CSV files in a directory automatically.

βœ… Data Filtering: Removes rows with missing 'Date' or 'Product' values.

βœ… Missing Value Imputation: Fills missing 'Sales' and 'Revenue' using column median.

βœ… Database Storage: Saves processed data into an SQLite database for structured retrieval.

βœ… Dynamic Output Naming: Generates filenames as cleaned_.csv.

βœ… Automatic Directory Handling: Creates an output directory if it doesn’t exist.

βœ… Logging System: Tracks each processing step and errors for debugging.

πŸ“„ Logging System

All logs are stored in data_preprocessing.log, tracking:

πŸ”Ή INFO: Normal operations like filtering and saving. πŸ”Ή WARNING: Non-critical issues such as missing values are handled with imputation. πŸ”Ή ERROR: Failures like missing input files or database errors.

πŸ“‚ Project Structure

πŸ“ Dat_Preprocessor

│── filter.py # Filters data based on required columns

│── imputation.py # Handles missing value imputation

│── database.py # Manages SQLite database operations

│── logger.py # Implements logging system

│── main.py # CLI entry point for execution

│── requirements.txt # Dependencies

│── README.md # Project documentation

πŸš€ Installation & Usage

1️⃣ Setup Environment

python -m venv venv source venv/bin/activate # On Windows use venv\Scripts\activate pip install -r requirements.txt

2️⃣ Run the Script

python main.py files/ output_files/

3️⃣ Query Cleaned Data (Example SQLite Query)

SELECT * FROM sales_data WHERE Product = 'Laptop';

πŸ€– Why CSV Preprocessing is Important?

πŸ”Ή Data Cleaning & Transformation: Ensures structured and error-free data. πŸ”Ή Automated Processing: Saves time by handling large datasets efficiently. πŸ”Ή Better Analysis & Machine Learning: Provides clean data for insightful analysis. πŸ”Ή Database Storage: Enables structured querying and retrieval of processed data.

This project provides an automated, scalable, and efficient solution for CSV data preprocessing, making data ready for analysis and machine learning tasks

About

This project is a CSV Data Preprocessing Pipeline that automates the cleaning of sales data. It reads multiple CSV files from a specified directory, processes them, and saves the cleaned outputs with dynamically generated filenames based on the year found in the input filenames.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages