Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration 2 hours in an oral glucose tolerance test
Blood Pressure: Diastolic blood pressure (mm Hg)
Skin Thickness: Triceps skin fold thickness (mm
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
Diabetes Pedigree Function: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1) 0 mean non-diabetic and 1 means diabetic
# Importing the required packages here
import numpy as np
import pandas as pd
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt
import missingno as msno
%matplotlib inline
df = pd.read_csv('diabetes.csv')
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
1. Please do the basic exploration of data and explain missing values, number of rows and columns and data types in statistical term.
No. of columns : 768
No. of rows : 9
Data type of each column : int64 = 5 ,float64 = 2 , category = 1
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome category
Missing Values of columns :
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
Dataset contain three type of columns:
int64 = 5
float64 = 2
category = 1
Outcome column has two value here :
0 = False (Not Diabetics)
1 = True (Diabetics)
There are total five columns which contain missing values.They are Insulin ,SkinThickness ,BloodPressure ,BMI and Glucose.
Insulin column has highest amount of missing value which is near to 47 precentage.
df.shape
(768, 9)
df.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
len(df.columns)
9
len(df)
768
df.dtypes
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object
df = df.astype({"Outcome":'category'})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 763 non-null float64
2 BloodPressure 733 non-null float64
3 SkinThickness 541 non-null float64
4 Insulin 394 non-null float64
5 BMI 757 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null float64
8 Outcome 768 non-null category
dtypes: category(1), float64(7), int64(1)
memory usage: 49.0 KB
df.isnull()
df.isnull().sum()
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
cols =['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age']
There are very limited possibility of zero value of following columns(mentioted above) so we have to replace value zero with NaN.
df[cols] = df[cols].replace({'0':np.nan, 0:np.nan})
df.isnull().sum()
Pregnancies 0
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
percent_missing = df.isnull().sum() * 100 / len(df)
percent_missing.plot.bar(figsize=(20, 10))
plt.title('Precentage_Missing_Values_Per_Column')
Text(0.5, 1.0, 'Precentage_Missing_Values_Per_Column')
msno.bar(df) # missing value of whole datframe
<AxesSubplot:>
As we can see that with help of kde , histogram plot data is symmetrical.
We can use mean as measures of central tendency for Glucose column.
Mean value of Glucose column is 121.69.
x = df['Glucose']
plt.hist(x)
plt.title("Glucose_Column ditribution", fontsize= 15)
plt.show()
sns.kdeplot(x,shade=True)
plt.title("Glucose_Column ditribution", fontsize= 15)
Text(0.5, 1.0, 'Glucose_Column ditribution')
print ("Mean Values in the Distribution of Glucose Column")
df['Glucose'].mean()
Mean Values in the Distribution of Glucose Column
121.6867627785059
The category column is bool type which is true or false.So ,we going to use mode measures of central tendency.
The mode for catogery column value of Zero/False.
Here means majority of people are non - daibetic.
print ("Mode of Outcome Column")
df['Outcome'].mode()
Mode of Outcome Column
0 False
Name: Outcome, dtype: category
Categories (2, object): [False, True]
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set_theme(style="whitegrid")
sns.boxplot(data=df,palette="Set3",linewidth=2.5,orient="h",showfliers=False)
<AxesSubplot:>
df.agg('max')
Pregnancies 17
Glucose 199.0
BloodPressure 122.0
SkinThickness 99.0
Insulin 846.0
BMI 67.1
DiabetesPedigreeFunction 2.42
Age 81.0
Outcome True
dtype: object
df.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
sns.set(rc={'figure.figsize':(4,2)})
sns.boxplot(x='BloodPressure',data =df,linewidth=0.5,orient="h")
<AxesSubplot:xlabel='BloodPressure'>
sns.boxplot(x='Glucose',data =df,linewidth=0.5,orient="h")
<AxesSubplot:xlabel='Glucose'>
sns.boxplot(x='SkinThickness',data =df,linewidth=0.5,orient="h")
<AxesSubplot:xlabel='SkinThickness'>
sns.boxplot(x='Insulin',data =df,linewidth=0.5,orient="h")
<AxesSubplot:xlabel='Insulin'>
sns.boxplot(x='BMI',data =df,linewidth=0.5,orient="h")
<AxesSubplot:xlabel='BMI'>
sns.boxplot(x='DiabetesPedigreeFunction',data =df,linewidth=0.5,orient="h")
<AxesSubplot:xlabel='DiabetesPedigreeFunction'>
sns.boxplot(x='Age',data =df,linewidth=0.5,orient="h",color=".50")
<AxesSubplot:xlabel='Age'>
sns.boxplot(x='Pregnancies',data =df,linewidth=0.5,orient="h",color=".25")
<AxesSubplot:xlabel='Pregnancies'>
sns.lmplot(x="Age", y="Glucose",data=df,fit_reg=True)
<seaborn.axisgrid.FacetGrid at 0x1ce3bf09700>
sns.relplot(x="Age", y="Glucose", kind="line",height=7, data=df)
<seaborn.axisgrid.FacetGrid at 0x1ce3a7dd400>
sns.relplot(x="Age", y="Glucose",hue="Outcome", kind="line",height=10, data=df)
<seaborn.axisgrid.FacetGrid at 0x1ce3c142070>
sns.relplot(x="timepoint", y="signal", hue="event", kind="line", data=df);
plt.title(" Distribution of Outcome Variable", fontsize= 15)
df['Outcome'].value_counts().plot.bar(figsize=(5, 2))
<AxesSubplot:title={'center':' Distribution of Outcome Variable'}>
6.Please examine the distribution of numerical data and explain which variable normally distributed and which variable is seems to be skewed. Please also tell the direction of skewness.
Normally Distibuted:
BloodPressure
Gulucose
Right Skewed :
Glucose
BMI
SkinThickness
Pregnancies
Age
DiabetesPedigreeFunction
Insulin
df.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
vals =['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age']
sns.catplot(x="X_Axis", y="vals", hue='cols', data=dfm, kind='point')
print(df.hist(bins=100))
[[<AxesSubplot:title={'center':'Pregnancies'}>
<AxesSubplot:title={'center':'Glucose'}>
<AxesSubplot:title={'center':'BloodPressure'}>]
[<AxesSubplot:title={'center':'SkinThickness'}>
<AxesSubplot:title={'center':'Insulin'}>
<AxesSubplot:title={'center':'BMI'}>]
[<AxesSubplot:title={'center':'DiabetesPedigreeFunction'}>
<AxesSubplot:title={'center':'Age'}> <AxesSubplot:>]]
7.Please calculate the skewness value and divide variables into symmetrical, moderately skewed and highly skewed.
skewValue = df.skew(axis=0,numeric_only=True,skipna=None)
skewValue
Pregnancies 0.901674
Glucose 0.530989
BloodPressure 0.134153
SkinThickness 0.690619
Insulin 2.166464
BMI 0.593970
DiabetesPedigreeFunction 1.919911
Age 1.129597
dtype: float64
if i in skewValue:
skew<0.5
print ("then very symmetric")
Symmetric :
BloodPressure
Moderatety Skewed :
Glucose
BMI
SkinThickness
Pregnancies
Highly Skewed :
Age
DiabetesPedigreeFunction
Insulin
Observation:
· If the absolute value of skew<0.5 then very symmetric.
· If the absolute value of skew is in between 0.5 and 1 then slightly skewed
· If the absolute value of skew is greater than 1 then very skewed.
8.Please create appropriate plot to examine the outliers of these variables. Please name the variables which have outliers.
Nmae of Varriable which contain outliers :
* Pregnancies
* BloodPressure
* SkinThickness
* Insulin
* BMI
* DiabetesPedigreeFunction
* Age
fig, ax = plt.subplots(figsize=(24,24), nrows=3, ncols=3)
sns.boxplot(data=df, y="Pregnancies", ax=ax[0,0])
sns.boxplot(data=df, y="Glucose", ax=ax[0,1])
sns.boxplot(data=df, y="BloodPressure", ax=ax[0,2])
sns.boxplot(data=df, y="SkinThickness", ax=ax[1,0])
sns.boxplot(data=df, y="Insulin", ax=ax[1,1])
sns.boxplot(data=df, y="Age", ax=ax[2,1])
sns.boxplot(data=df, y="BMI", ax=ax[2,0])
sns.boxplot(data=df, y="DiabetesPedigreeFunction", ax=ax[1,2])
<AxesSubplot:ylabel='DiabetesPedigreeFunction'>
df5 =df.aggregate({'Pregnancies':['max', 'min','var','std',np.median],'Insulin':['max', 'min','var',np.median,'std'],'SkinThickness':['max', 'min','var',np.median,'std'],'DiabetesPedigreeFunction':['max', 'min','var','std',np.median],'BMI':['max', 'min','var',np.median,'std'],'Age':['max', 'min','var',np.median,'std']})
df5
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Pregnancies | Insulin | SkinThickness | DiabetesPedigreeFunction | BMI | Age | |
---|---|---|---|---|---|---|
max | 17.000000 | 846.000000 | 99.000000 | 2.420000 | 67.100000 | 81.000000 |
min | 0.000000 | 14.000000 | 7.000000 | 0.078000 | 18.200000 | 21.000000 |
var | 11.354056 | 14107.703775 | 109.767160 | 0.109779 | 47.955463 | 138.303046 |
std | 3.369578 | 118.775855 | 10.476982 | 0.331329 | 6.924988 | 11.760232 |
median | 3.000000 | 125.000000 | 29.000000 | 0.372500 | 32.300000 | 29.000000 |
def find_iqr(x):
return np.subtract(*np.percentile(x, [75, 25]))
df[['Pregnancies','SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age']].apply(find_iqr)
Pregnancies 5.0000
SkinThickness NaN
Insulin NaN
BMI NaN
DiabetesPedigreeFunction 0.3825
Age 17.0000
dtype: float64
from scipy.stats import iqr
iqr(df['Insulin'])
nan