BEFORE CLEANING
AFTER CLEANING
This repository provides a comprehensive, step-by-step tutorial on handling missing data within the Titanic dataset using Python's Pandas library. Missing data is a prevalent challenge in real-world datasets, and mastering its treatment is a fundamental skill for any aspiring Data Scientist or AI Developer.
This tutorial guides you through identifying, visualizing, and strategically imputing or dropping missing values, ensuring your data is clean and prepared for robust analysis or machine learning model training.
- Load & Inspect Data: Learn to load CSV files and get a first glance at dataset structure and data types.
- Missing Data Detection: Identify missing values by count and percentage in each column.
- Visualizing Missing Patterns: Utilize
Seaborn
(and optionallymissingno
) to graphically understand missing data distribution and relationships. - Strategic Handling: Apply various techniques including:
- Dropping rows with missing values.
- Dropping columns with excessive missing values (e.g.,
Cabin
). - Imputing numerical missing values (e.g.,
Age
with median). - Imputing categorical missing values (e.g.,
Embarked
with mode).
- Code-Along Tutorial: Follow clear, explained code snippets in a Jupyter Notebook.
This project requires the following Python libraries. You can install them using pip
:
pip install pandas numpy matplotlib seaborn missingno # missingno is optional but recommended
This tutorial guides you through the following core steps in the titanic_missing_data_tutorial.ipynb Jupyter Notebook:
-
Environment Setup: Import pandas, numpy, matplotlib, and seaborn.
-
Data Loading: Read train.csv into a Pandas DataFrame.
-
Initial Inspection:
- View first few rows (df.head()).
- Get summary info, non-null counts, and data types (df.info()).
- Calculate number and percentage of missing values (df.isnull().sum()).
-
Missing Data Visualization:
- Use seaborn.heatmap to visualize missing data patterns.
- (Optional) Utilize missingno for specialized nullity matrices and bar charts.
-
Handling Missing Values:
- Dropping Rows: Remove rows with any missing values (use cautiously to avoid excessive data loss).
- Dropping Columns: Remove columns with a high percentage of missing values (e.g., Cabin).
- Imputing Numerical Data: Fill missing numerical values (e.g., Age) with the median (robust for skewed data) or mean (for symmetrical data).
- Imputing Categorical Data: Fill missing categorical values (e.g., Embarked) with the mode (most frequent category).
-
Final Verification: Confirm all missing values are addressed using df.isnull().sum().
Clone the repository:
git clone [https://github.com/](https://github.com/)iremcira/Python-Pandas-Titanic-Missing-Data-Tutorial.git
cd Python-Pandas-Titanic-Missing-Data-Tutorial
Download the dataset: Get train.csv from Kaggle Titanic competition and place it in the project root.
Install dependencies:
pip install -r requirements.txt
Open Jupyter Notebook:
jupyter notebook
Then, open titanic_missing_data_tutorial.ipynb.
After this tutorial, you can:
-
Feature Engineering: Create new features (e.g., FamilySize, Title).
-
Outlier Handling: Identify and treat outliers in numerical columns.
-
Data Transformation: Apply scaling or logarithmic transformations.
-
Categorical Encoding: Convert categorical features to numerical (One-Hot, Label Encoding).
-
Build an ML Model: Use the cleaned data to train a simple classification model (e.g., Logistic Regression) to predict survival.
This tutorial complements DeepCode Ally AI: Python Quick Reference Guide for AI Developers, your go-to resource for Python essentials in AI/ML.
Contributions are welcome! Feel free to open issues or submit pull requests.
This project is licensed under the MIT License.
- GitHub: iremcira
- LinkedIn: İremnur Çıra
- Email: iremcira123@gmail.com