Skip to content

Latest commit

 

History

History
407 lines (245 loc) · 12.8 KB

chapter3.md

File metadata and controls

407 lines (245 loc) · 12.8 KB

title : Exploratory analysis in Python using Pandas description : We start with the first step of data analysis - the exploratory data analysis.

--- type:NormalExercise lang:python xp:100 skills:2 key:af2f6f90f3

Case study - Who is eligible for loan?

###Introduction - Analytics Vidhya (AV) DataHack At Analytics Vidhya, we are building a knowledge platform for data science professionals across the globe. Among several things, we host several hackathons for our community on our DataHack platform. The case study for today's problem is one of the practice problem on our platform. You can check out the practice problem here.

###The case study - Dream Housing Finance

Dream Housing Finance company deals in all home loans. They have a presence across all urban, semi-urban and rural areas. Customers first apply for a home loan after that company validates the customer's eligibility. The company wants to automate the loan eligibility process (real-time) based on customer detail provided while filling online application form.

Let's start with loading the training and testing set into your python environment. You will use the training set to build your model, and the test set to validate it. Both the files are stored on the web as CSV files; their URLs are already available as character strings in the sample code.

You can load this data with the pandas.read_csv() function. It converts the data set to a python dataframe. In simple words, Python dataframe can be imagined as an equivalent of a spreadsheet or a SQL table.

*** =instructions

  • train.head(n) helps to look at top n observation of train dataframe. Use it to print top 5 observations of train.
  • len(DataFrame) returns the total number of observations. Store the number of observations in train data in variable train_length
  • DataFrame.columns returns the total columns heading of the data set. Store the number of columns in test datasetin variable test_col

*** =hint

  • Use len(dataframe) to return the total observations
  • Use len(dataframe.columns) to return the total available columns

*** =pre_exercise_code


# The pre-exercise code runs code to initialize the user's workspace. You can use it for several things:

# Import library pandas
import pandas as pd

# Import train file
train = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/av-datahack-datacamp/train.csv")

# Import test file
test = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/av-datahack-datacamp/test.csv")

*** =sample_code


# import library pandas
import pandas as pd

# Import training data as train
train = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/av-datahack-datacamp/train.csv")

# Import testing data as test
test = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/av-datahack-datacamp/test.csv")

# Print top 5 observation of train dataset
print (train.____() )

# Store total number of observation in training dataset
train_length = len (_____)

# Store total number of columns in testing data set
test_col = len ( test._____)

*** =solution


import pandas as pd

# Import training data as train
train = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/av-datahack-datacamp/train.csv")

# Import testing data as test
test = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/av-datahack-datacamp/test.csv")

# Print top 5 observation of test dataset
print (train.head(5))

# Store total number of observation in training dataset
train_length = len(train)

# Store total number of columns in testing data set
test_col = len(test.columns)

*** =sct

# The sct section defines the Submission Correctness Tests (SCTs) used to
# evaluate the student's response. All functions used here are defined in the
# pythonwhat Python package. Documentation can also be found at github.com/datacamp/pythonwhat/wiki

# Test for evaluating top 5 heading of dataframe
test_function("print", incorrect_msg = "Don't forget to print the first 5 observations of `train`!")

# Test for total observation in training dataset
test_object("train_length", incorrect_msg = "Don't forget to store the length of `train` in train_length")

# Test for total columns in testing dataset
test_object("test_col", incorrect_msg = "Don't forget to store the number of columns of `test` in test_col")

success_msg("Great work! Let us look at the data more closely")

--- type:NormalExercise lang:python xp:100 skills:2 key:36c3190b26

Understanding the Data

You can look at a summary of numerical fields by using dataframe.describe(). It provides the count, mean, standard deviation (std), min, quartiles and max in its output.

dataframe.describe()

For the non-numeric values (e.g. Property_Area, Credit_History etc.), we can look at frequency distribution. The frequency table can be printed by the following command:

df[column_name].value_counts()
OR
df.column_name.value_counts()

*** =instructions

  • Use dataframe.describe() to understand the distribution of numerical variables
  • Look at unique values of non-numeric values using df[column_name].value_counts()

*** =hint

  • Store the output of train.describe() in a variable df
  • Use train.PropertyArea.value_counts() to look at frequency distribution

*** =pre_exercise_code


# The pre-exercise code runs code to initialize the user's workspace. You can use it for several things:

# Import library pandas
import pandas as pd

# Import training file
train = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/av-datahack-datacamp/train.csv")

# Import testing file
test = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/av-datahack-datacamp/test.csv")

*** =sample_code


#Training and Testing data set are loaded in train and test dataframe respectively

# Look at the summary of numerical variables for train data set
df= train.________()
print (df)

# Print the unique values and their frequency of variable Property_Area
df1=train.Property_Area.________()
print (df1)

*** =solution


# Look at the summary of numerical variables for train data set
df = train.describe()
print (df)

# Print the unique values and their frequency of variable Property_Area
df1=train.Property_Area.value_counts()
print (df1)

*** =sct

# The sct section defines the Submission Correctness Tests (SCTs) used to
# evaluate the student's response. All functions used here are defined in the
# pythonwhat Python package. Documentation can also be found at github.com/datacamp/pythonwhat/wiki

# Test for describe
test_function("train.describe", not_called_msg = "Did you call the right function with train dataset to see numerical summary?")
# Test for value_counts
test_function("train.Property_Area.value_counts", not_called_msg = "Did you call the right function with train dataset to see frequency table of 'Property_Area'?")

success_msg("Great work!")

--- type:NormalExercise lang:python xp:100 skills:2, 4 key:85c5d3a079

Understanding distribution of numerical variables

Now that we are familiar with basic data characteristics, let us study the distribution of numerical variables. Let us start with numeric variable "ApplicantIncome".

Let's start by plotting the histogram of ApplicantIncome using the following command:

train['ApplicantIncome'].hist(bins=50)
Or
train.ApplicantIncome.hist(bins=50)

Next, we can also look at box plots to understand the distributions. Box plot for ApplicantIncome can be plotted by

train.boxplot(column='ApplicantIncome')

*** =instructions

  • Use hist() to plot histogram
  • Use by=categorical_variable with box plot to look at distribution by categories
train.boxplot(column='ApplicantIncome', by='Gender')

*** =hint

  • Use dataframe.columnname1.hist() to plot histogram
  • Use dataframe.boxplot(column='columnname2', by = 'columnname3' ) to have boxplot by different categories of a categorical variable

*** =pre_exercise_code


# The pre-exercise code runs code to initialize the user's workspace. You can use it for several things:

# Import library pandas
import pandas as pd

# Import training file
train = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/av-datahack-datacamp/train.csv")

# Import testing file
test = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/av-datahack-datacamp/test.csv")

*** =sample_code


# Training and Testing dataset are loaded in train and test dataframe respectively
# Plot histogram for variable LoanAmount
train.LoanAmount._____

# Plot a box plot for variable LoanAmount by variable Gender of training data set
train._______(column='LoanAmount', by = 'Gender')

*** =solution



# Assumed training and testing dataset are loaded in train and test dataframe respectively
# Plot histogram for variable LoanAmount
train.LoanAmount.hist()

# Plot a box plot for variable LoanAmount by variable Gender of training data set
train.boxplot(column='LoanAmount', by ='Gender' )

*** =sct

# The sct section defines the Submission Correctness Tests (SCTs) used to
# evaluate the student's response. All functions used here are defined in the
# pythonwhat Python package. Documentation can also be found at github.com/datacamp/pythonwhat/wiki

# Test for evaluating histogram
test_function("train.LoanAmount.hist", not_called_msg = "Did you call the right function to plot histogram?")

# Test for evaluating box plot
test_function("train.boxplot", not_called_msg = "Did you call the right function for boxplot?")

success_msg("Great work!")

--- type:NormalExercise lang:python xp:100 skills:2, 4 key:708e937aea

Understanding distribution of categorical variables

We have looked at the distributions of ApplicantIncome and LoanIncome, now it's time for looking at categorical variables in more details. For instance, let's see whether Gender is affecting the loan status or not. This can be tested using cross-tabulation as shown below:

pd.crosstab( train ['Gender'], train ["Loan_Status"], margins=True)

Next, we can also look at proportions can be more intuitive in making some quick insights. We can do this using the apply function. You can read more about cross tab and apply functions here.


def percentageConvert(ser):
  return ser/float(ser[-1])

pd.crosstab(train ["Gender"], train ["Loan_Status"], margins=True).apply(percentageConvert, axis=1)

*** =instructions

  • Use value_counts() with train['LoanStatus'] to look at the frequency distribution
  • Use crosstab with Loan_Status and Credit_History to perform bi-variate analysis

*** =hint train['Loan_Status'].value_counts() return the frequency by each category of categorical variable

*** =pre_exercise_code


# The pre-exercise code runs code to initialize the user's workspace. You can use it for several things:

# Import library pandas
import pandas as pd

# Import training file
train = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/av-datahack-datacamp/train.csv")

# Import testing file
test = pd.read_csv("https://s3-ap-southeast-1.amazonaws.com/av-datahack-datacamp/test.csv")

*** =sample_code


# Training and Testing dataset are loaded in train and test dataframe respectively

# Approved Loan in absolute numbers
loan_approval = train['Loan_Status'].________()['Y']

# Two-way comparison: Credit History and Loan Status
twowaytable = pd.________(train ["Credit_History"], train ["Loan_Status"], margins=True)



*** =solution


# Assumed training and testing dataset are loaded in train and test dataframe respectively

# Approved Loan in absolute numbers
loan_approval = train['Loan_Status'].value_counts()['Y']

# Two-way comparison: Credit_History and Loan_Status
twowaytable = pd.crosstab(train ["Credit_History"], train ["Loan_Status"], margins=True)

*** =sct

# The sct section defines the Submission Correctness Tests (SCTs) used to
# evaluate the student's response. All functions used here are defined in the
# pythonwhat Python package. Documentation can also be found at github.com/datacamp/pythonwhat/wiki

# Test for Approved Loan in absolute numbers
test_object("loan_approval", incorrect_msg='Did you look at the frequency distribution?',undefined_msg='Did you look at the frequency distribution?')


# Test for two-way comparison Credit_History and Loan_Status
test_object("twowaytable", incorrect_msg='Did you use the right function to generate two way table?', undefined_msg='Did you use the right function to generate two way table?')


success_msg("Great work!")