This repository contains a collection of my latest data science and machine learning projects. Each project highlights specific techniques, tools, and technologies used to solve real-world problems and derive actionable insights.
- Customer Segmentation and Market Basket Analysis for E-commerce Retail
- Market Price Prediction
- Movie Genre Classification
- Predictive Modeling for Disease Diagnosis
- Credit Card Transactions Fraud Detection
- NLP Newsgroups Classification and Deployment
- Cab Industry Analysis: Data Exploration, Hypothesis Testing, and Strategic Recommendations
- New York Housing Market Analysis and Price Prediction
- Gym Members Calories Prediction with CatBoost
- Air Quality Prediction
- Description: Led a data-driven project focused on customer segmentation, sales trend analysis, and market basket analysis using a dataset from a UK-based online retailer. The project involved in-depth exploration of customer purchasing patterns, segmentation based on Recency, Frequency, Monetary (RFM) analysis, and discovery of product associations using the Apriori algorithm. The outcomes provided valuable insights for enhancing marketing strategies, product placement, and inventory management.
- Technologies Used: Python, pandas, Seaborn, Scikit-learn, NetworkX, mlxtend
- Techniques: RFM-T Segmentation, Market Basket Analysis, K-Means Clustering, Apriori Algorithm
- Key Impact:
- Identified customer segments for targeted marketing and retention strategies.
- Discovered high-confidence product association rules for effective cross-selling and product bundling.
- Provided actionable insights for marketing campaigns, product placement, and inventory management strategies.
- Description: Developed a robust time series forecasting model for market analysis, focusing on predicting the quantity and prices of commodities based on historical data. The project involved data preprocessing, exploratory data analysis, feature engineering, model selection, training, and evaluation. Several models were tested, including ARIMA, SARIMA, Prophet, and LSTM, with LSTM models showing significant promise, especially in price forecasting.
- Technologies Used: Python, Pandas, NumPy, ARIMA, SARIMA, Prophet, LSTM
- Key Impact:
- Achieved high accuracy in forecasting commodity prices using the LSTM model.
- Contributed to optimizing inventory management and pricing strategies.
- Provided actionable insights for market analysis.
- Description: Developed a comprehensive machine learning pipeline to classify movie genres based on descriptions using models such as Logistic Regression, SVM, Random Forest, and XGBoost. Explored feature extraction techniques, including TF-IDF, Word2Vec, and GloVe embeddings. The approach involved preprocessing, model training, evaluation, and deployment.
- Technologies Used: Python, Pandas, NumPy, Scikit-learn, XGBoost, Word2Vec, GloVe
- Key Impact:
- Achieved an accuracy of 0.58 with the SVM model using TF-IDF features.
- Demonstrated significant insights into NLP techniques for text classification.
- Provided a foundation for recommendation systems.
- Description: Built predictive models to classify individuals into diseased or non-diseased categories based on health attributes. The project aimed to assist healthcare professionals in early detection and personalized patient care.
- Technologies Used: Python, Pandas, Scikit-learn, XGBoost, SHAP
- Key Impact:
- Achieved 99.5% accuracy with the XGBoost model.
- Provided a reliable tool for early disease detection, enhancing patient outcomes.
- Description: Developed machine learning models to detect fraudulent credit card transactions. The project involved data preprocessing, feature engineering, and extensive exploratory data analysis (EDA).
- Technologies Used: Python, Scikit-learn, XGBoost, RandomForest, SMOTE
- Key Impact:
- Built a well-balanced fraud detection system with RandomForest and XGBoost models.
- Improved precision and recall for fraud detection.
- Description: Developed a robust document classification system using the 20 Newsgroups dataset. The system classifies documents into categories, with applications in spam filtering and sentiment analysis.
- Technologies Used: Python, Scikit-learn, SpaCy, NLTK
- Key Impact:
- Achieved an F1-score of 0.83 and ROC-AUC score of 0.987.
- Successfully deployed the model for real-time classification.
- Description: Analyzed U.S. cab industry data to identify the most suitable company for investment. The project focused on customer usage patterns, market dynamics, and profitability trends.
- Technologies Used: Python, Pandas, Statsmodels
- Key Impact:
- Provided strategic recommendations for investment based on market dynamics.
- Description: Developed a machine learning pipeline for predicting housing prices in New York. Included data collection, exploratory data analysis, model training, and deployment.
- Technologies Used: Python, XGBoost, Flask
- Key Impact:
- Achieved a high R^2 score of 0.775 for housing price predictions.
- Delivered a functional web app for real-time price prediction.
-
Description: This project predicts the number of calories burned by gym members during exercise sessions based on health and activity features. The model was trained using CatBoost, achieving high accuracy. It was deployed as a web service via FastAPI, containerized with Docker for seamless deployment. The project emphasizes real-time predictions for personalized fitness planning and progress tracking.
-
Technologies Used: Programming Languages: Python Libraries and Frameworks: CatBoost, FastAPI, SHAP, Pandas, NumPy Deployment Tools: Docker, Uvicorn Data Handling: RFE, Feature Engineering, Data Preprocessing Model Training: CatBoost with hyperparameter tuning (Optuna)
-
Key Impact: Achieved a low RMSE of 8.13, indicating high prediction accuracy. Deployed a scalable web service for real-time calorie predictions. Enhanced personalized fitness tracking and provided actionable insights for gym members.
-
Description: Developed and deployed a machine learning-based system to predict air quality levels using a dataset of environmental and demographic metrics. The project included extensive data preprocessing, exploratory data analysis, model selection, and hyperparameter tuning. The final solution was deployed as a web service using FastAPI, Docker, and Kubernetes, with integrated monitoring via Prometheus and Grafana. The deployed application provides real-time air quality predictions, enabling actionable insights for governments, industries, and individuals to mitigate the effects of air pollution.
-
Technologies Used: Python, pandas, Seaborn, Scikit-learn, CatBoost, XGBoost, LightGBM, FastAPI, Docker, Kubernetes, Prometheus, Grafana, Render
-
Techniques: Class Imbalance Handling, Weighted Metrics (Weighted F1-Score), Feature Engineering, Optuna Hyperparameter Tuning, Containerization, Cloud Deployment, Monitoring
-
Key Impact: Achieved a high Weighted F1-Score of 0.9578 using the CatBoost model, demonstrating its effectiveness in handling imbalanced datasets and predicting critical air quality levels. Identified key environmental factors like Carbon Monoxide (CO) and proximity to industrial areas as major contributors to poor air quality. Successfully deployed the application in a production environment, offering an interactive API for real-time air quality predictions. Integrated monitoring tools (Prometheus and Grafana) for tracking service performance and usage metrics, ensuring reliability and transparency. Provided actionable insights to stakeholders for improving public health and environmental policies.