This project involves an Exploratory Data Analysis (EDA) of the Titanic dataset to understand the factors that influenced survival rates among passengers. The analysis covers various aspects, including demographics, ticket class, and cabin locations.
The Titanic dataset used in this project is sourced from Kaggle. The dataset includes information about passengers on the Titanic, such as their age, gender, ticket class, and survival status.
- Loaded the dataset using Pandas and explored its basic structure using
info()
andhead()
functions. - Identified columns with missing values and decided on appropriate handling strategies.
- Handled missing values in the 'Age' column by filling them with the median age.
- Handled missing values in the 'Embarked' column by filling them with the most common port of embarkation.
- Documented any additional steps taken to clean the data.
- Visualized the gender distribution and passenger class distribution using countplots.
- Extracted insights into the composition of passengers based on gender and class.
- Analyzed survival rates based on gender using a countplot.
- Noted any significant findings or trends, such as higher survival rates for a particular gender.
- Plotted countplots to visualize survival rates based on passenger class and a heatmap for survival by cabin.
- Explored insights into the relationship between ticket class, cabin location, and survival.
- Documented the steps taken in the Jupyter Notebook and provided explanations for each code section.
- Included insights and findings discovered during the analysis.
- Added visualizations to support key points.
- Created a README file to guide users on how to reproduce the analysis.
The EDA provided valuable insights into the factors influencing survival rates on the Titanic. Gender and ticket class emerged as significant factors, with women having higher survival rates and passengers in higher classes having better chances of survival. The analysis also considered cabin locations, although the high number of missing values in the 'Cabin' column limited the depth of exploration in this aspect.
- Further exploration of the 'Cabin' variable, perhaps grouping cabins based on their locations.
- Analysis of age groups to identify specific demographics with higher or lower survival rates.
- Consideration of interactions between different variables and their combined impact on survival.
- Python 3.x
- Libraries: pandas, numpy, seaborn, matplotlib
- Clone the repository:
git clone https://github.com/your-username/Titanic-EDA-Analysis.git
- Open and run the Jupyter Notebook:
titanic_analysis.ipynb
- Explore the code, visualizations, and documentation to understand the analysis.
Contributions are welcome! Feel free to open issues or submit pull requests.
This project is licensed under the MIT License.