GitHub - goswamimohit/PIMA-INDIAN-Diabetic: Exploratory Data Aanlysis

Columns of the dataset:

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration 2 hours in an oral glucose tolerance test
Blood Pressure: Diastolic blood pressure (mm Hg)
Skin Thickness: Triceps skin fold thickness (mm
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2) 
Diabetes Pedigree Function: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1) 0 mean non-diabetic and 1 means diabetic

# Importing the required packages here

import numpy as np
import pandas as pd
import seaborn as sns

from datetime import datetime
import matplotlib.pyplot as plt

import missingno as msno
%matplotlib inline

df = pd.read_csv('diabetes.csv')
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

1. Please do the basic exploration of data and explain missing values, number of rows and columns and data types in statistical term.

No. of columns : 768

No. of rows : 9

Data type of each column : int64 = 5 ,float64 = 2 , category = 1

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       category

Missing Values of columns :

Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11

Dataset contain three type of columns:

int64 = 5
float64 = 2 
category = 1

Outcome column has two value here :

 0 = False (Not Diabetics)

 1 = True (Diabetics)

There are total five columns which contain missing values.They are Insulin ,SkinThickness ,BloodPressure ,BMI and Glucose.

Insulin column has highest amount of missing value which is near to 47 precentage.

df.shape

(768, 9)

df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

len(df.columns)

len(df)

df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

df = df.astype({"Outcome":'category'})

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Pregnancies               768 non-null    int64   
 1   Glucose                   763 non-null    float64 
 2   BloodPressure             733 non-null    float64 
 3   SkinThickness             541 non-null    float64 
 4   Insulin                   394 non-null    float64 
 5   BMI                       757 non-null    float64 
 6   DiabetesPedigreeFunction  768 non-null    float64 
 7   Age                       768 non-null    float64 
 8   Outcome                   768 non-null    category
dtypes: category(1), float64(7), int64(1)
memory usage: 49.0 KB

df.isnull()

df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

cols =['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age']

There are very limited possibility of zero value of following columns(mentioted above) so we have to replace value zero with NaN.

df[cols] = df[cols].replace({'0':np.nan, 0:np.nan})

df.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

percent_missing = df.isnull().sum() * 100 / len(df)

percent_missing.plot.bar(figsize=(20, 10))
plt.title('Precentage_Missing_Values_Per_Column')

Text(0.5, 1.0, 'Precentage_Missing_Values_Per_Column')

msno.bar(df) # missing value of  whole datframe

<AxesSubplot:>

Analysis:

2.Calculate appropriate measures of central tendency for Glucose and outcome column only?

As we can see that with help of kde , histogram plot data is symmetrical.

We can use mean as measures of central tendency for Glucose column.

Mean value of Glucose column is 121.69.

x = df['Glucose']
plt.hist(x)
plt.title("Glucose_Column ditribution", fontsize= 15)
plt.show()

sns.kdeplot(x,shade=True)
plt.title("Glucose_Column ditribution", fontsize= 15)

Text(0.5, 1.0, 'Glucose_Column ditribution')

print ("Mean Values in the Distribution of Glucose Column")
df['Glucose'].mean()

Mean Values in the Distribution of Glucose Column





121.6867627785059

The category column is bool type which is true or false.So ,we going to use mode measures of central tendency.

The mode for catogery column value of Zero/False.

Here means majority of people are non - daibetic.

print ("Mode of Outcome Column")
df['Outcome'].mode()

Mode of Outcome Column





0    False
Name: Outcome, dtype: category
Categories (2, object): [False, True]

3.Please provide 5 points data summaries for required columns?

sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set_theme(style="whitegrid")

sns.boxplot(data=df,palette="Set3",linewidth=2.5,orient="h",showfliers=False)

<AxesSubplot:>

df.agg('max')

Pregnancies                    17
Glucose                     199.0
BloodPressure               122.0
SkinThickness                99.0
Insulin                     846.0
BMI                          67.1
DiabetesPedigreeFunction     2.42
Age                          81.0
Outcome                      True
dtype: object

df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

sns.set(rc={'figure.figsize':(4,2)})

sns.boxplot(x='BloodPressure',data =df,linewidth=0.5,orient="h")

<AxesSubplot:xlabel='BloodPressure'>

sns.boxplot(x='Glucose',data =df,linewidth=0.5,orient="h")

<AxesSubplot:xlabel='Glucose'>

sns.boxplot(x='SkinThickness',data =df,linewidth=0.5,orient="h")

<AxesSubplot:xlabel='SkinThickness'>

sns.boxplot(x='Insulin',data =df,linewidth=0.5,orient="h")

<AxesSubplot:xlabel='Insulin'>

sns.boxplot(x='BMI',data =df,linewidth=0.5,orient="h")

<AxesSubplot:xlabel='BMI'>

sns.boxplot(x='DiabetesPedigreeFunction',data =df,linewidth=0.5,orient="h")

<AxesSubplot:xlabel='DiabetesPedigreeFunction'>

sns.boxplot(x='Age',data =df,linewidth=0.5,orient="h",color=".50")

<AxesSubplot:xlabel='Age'>

sns.boxplot(x='Pregnancies',data =df,linewidth=0.5,orient="h",color=".25")

<AxesSubplot:xlabel='Pregnancies'>

4.Please create an appropriate plot to examine the relationship between Age and Glucose.

sns.lmplot(x="Age", y="Glucose",data=df,fit_reg=True)

<seaborn.axisgrid.FacetGrid at 0x1ce3bf09700>

sns.relplot(x="Age", y="Glucose", kind="line",height=7, data=df)

<seaborn.axisgrid.FacetGrid at 0x1ce3a7dd400>

sns.relplot(x="Age", y="Glucose",hue="Outcome", kind="line",height=10, data=df)

<seaborn.axisgrid.FacetGrid at 0x1ce3c142070>

sns.relplot(x="timepoint", y="signal", hue="event", kind="line", data=df);

5.Please create an appropriate plot to see the distribution of Outcome variable?

plt.title(" Distribution of Outcome Variable", fontsize= 15)
df['Outcome'].value_counts().plot.bar(figsize=(5, 2))

<AxesSubplot:title={'center':' Distribution of Outcome Variable'}>

6.Please examine the distribution of numerical data and explain which variable normally distributed and which variable is seems to be skewed. Please also tell the direction of skewness.

Normally Distibuted:

BloodPressure
Gulucose

Right Skewed :

Glucose
BMI
SkinThickness
Pregnancies
Age
DiabetesPedigreeFunction
Insulin

df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

vals =['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age']

sns.catplot(x="X_Axis", y="vals", hue='cols', data=dfm, kind='point')

print(df.hist(bins=100))

[[<AxesSubplot:title={'center':'Pregnancies'}>
  <AxesSubplot:title={'center':'Glucose'}>
  <AxesSubplot:title={'center':'BloodPressure'}>]
 [<AxesSubplot:title={'center':'SkinThickness'}>
  <AxesSubplot:title={'center':'Insulin'}>
  <AxesSubplot:title={'center':'BMI'}>]
 [<AxesSubplot:title={'center':'DiabetesPedigreeFunction'}>
  <AxesSubplot:title={'center':'Age'}> <AxesSubplot:>]]

7.Please calculate the skewness value and divide variables into symmetrical, moderately skewed and highly skewed.

skewValue = df.skew(axis=0,numeric_only=True,skipna=None)

skewValue

Pregnancies                 0.901674
Glucose                     0.530989
BloodPressure               0.134153
SkinThickness               0.690619
Insulin                     2.166464
BMI                         0.593970
DiabetesPedigreeFunction    1.919911
Age                         1.129597
dtype: float64

if i in skewValue:
    skew<0.5
    print ("then very symmetric")

Symmetric :

BloodPressure

Moderatety Skewed :

Glucose
BMI
SkinThickness
Pregnancies

Highly Skewed :

Age
DiabetesPedigreeFunction
Insulin

Observation:

· If the absolute value of skew<0.5 then very symmetric.

· If the absolute value of skew is in between 0.5 and 1 then slightly skewed

· If the absolute value of skew is greater than 1 then very skewed.

8.Please create appropriate plot to examine the outliers of these variables. Please name the variables which have outliers.

Nmae of Varriable which contain outliers :

* Pregnancies

* BloodPressure

* SkinThickness

* Insulin

* BMI

* DiabetesPedigreeFunction

* Age

fig, ax = plt.subplots(figsize=(24,24), nrows=3, ncols=3)
sns.boxplot(data=df, y="Pregnancies", ax=ax[0,0])
sns.boxplot(data=df, y="Glucose", ax=ax[0,1])
sns.boxplot(data=df, y="BloodPressure", ax=ax[0,2])
sns.boxplot(data=df, y="SkinThickness", ax=ax[1,0])
sns.boxplot(data=df, y="Insulin", ax=ax[1,1])
sns.boxplot(data=df, y="Age", ax=ax[2,1])
sns.boxplot(data=df, y="BMI", ax=ax[2,0])
sns.boxplot(data=df, y="DiabetesPedigreeFunction", ax=ax[1,2])

<AxesSubplot:ylabel='DiabetesPedigreeFunction'>

9.What should be the measures of central tendency and dispersion for skewed data?

df5 =df.aggregate({'Pregnancies':['max', 'min','var','std',np.median],'Insulin':['max', 'min','var',np.median,'std'],'SkinThickness':['max', 'min','var',np.median,'std'],'DiabetesPedigreeFunction':['max', 'min','var','std',np.median],'BMI':['max', 'min','var',np.median,'std'],'Age':['max', 'min','var',np.median,'std']})

df5

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Pregnancies	Insulin	SkinThickness	DiabetesPedigreeFunction	BMI	Age
max	17.000000	846.000000	99.000000	2.420000	67.100000	81.000000
min	0.000000	14.000000	7.000000	0.078000	18.200000	21.000000
var	11.354056	14107.703775	109.767160	0.109779	47.955463	138.303046
std	3.369578	118.775855	10.476982	0.331329	6.924988	11.760232
median	3.000000	125.000000	29.000000	0.372500	32.300000	29.000000

def find_iqr(x):
  return np.subtract(*np.percentile(x, [75, 25]))
df[['Pregnancies','SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age']].apply(find_iqr)

Pregnancies                  5.0000
SkinThickness                   NaN
Insulin                         NaN
BMI                             NaN
DiabetesPedigreeFunction     0.3825
Age                         17.0000
dtype: float64

from scipy.stats import iqr

iqr(df['Insulin'])

nan

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Data		Data
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
Mohit_DS11_PIAMA_INDIA_DIABETICS_01.ipynb		Mohit_DS11_PIAMA_INDIA_DIABETICS_01.ipynb
PIAMA_INDIA_DIABETICS.ipynb		PIAMA_INDIA_DIABETICS.ipynb
PIAMA_INDIA_DIABETICS_02.ipynb		PIAMA_INDIA_DIABETICS_02.ipynb
Pima_Indians_Diabetes_Database_Descriptive_Statistics_Project.txt		Pima_Indians_Diabetes_Database_Descriptive_Statistics_Project.txt
Prediction of Onset of Diabetes.ipynb		Prediction of Onset of Diabetes.ipynb
READM.md		READM.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Columns of the dataset:

1. Please do the basic exploration of data and explain missing values, number of rows and columns and data types in statistical term.

Analysis:

2.Calculate appropriate measures of central tendency for Glucose and outcome column only?

3.Please provide 5 points data summaries for required columns?

4.Please create an appropriate plot to examine the relationship between Age and Glucose.

5.Please create an appropriate plot to see the distribution of Outcome variable?

6.Please examine the distribution of numerical data and explain which variable normally distributed and which variable is seems to be skewed. Please also tell the direction of skewness.

7.Please calculate the skewness value and divide variables into symmetrical, moderately skewed and highly skewed.

8.Please create appropriate plot to examine the outliers of these variables. Please name the variables which have outliers.

9.What should be the measures of central tendency and dispersion for skewed data?

About

Releases

Packages

Languages

goswamimohit/PIMA-INDIAN-Diabetic

Folders and files

Latest commit

History

Repository files navigation

Columns of the dataset:

1. Please do the basic exploration of data and explain missing values, number of rows and columns and data types in statistical term.

Analysis:

2.Calculate appropriate measures of central tendency for Glucose and outcome column only?

3.Please provide 5 points data summaries for required columns?

4.Please create an appropriate plot to examine the relationship between Age and Glucose.

5.Please create an appropriate plot to see the distribution of Outcome variable?

6.Please examine the distribution of numerical data and explain which variable normally distributed and which variable is seems to be skewed. Please also tell the direction of skewness.

7.Please calculate the skewness value and divide variables into symmetrical, moderately skewed and highly skewed.

8.Please create appropriate plot to examine the outliers of these variables. Please name the variables which have outliers.

9.What should be the measures of central tendency and dispersion for skewed data?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages