Skip to content

Latest commit

 

History

History
49 lines (39 loc) · 1.37 KB

Data Preparation.md

File metadata and controls

49 lines (39 loc) · 1.37 KB

Tidy Data

  • Tidy datasets are easy to manipulate model and visualize and have a specific structure.
  • Data Normalization -- each var in a column -- each observation in a row -- each type of observation unit is aa table

Getting Data

  • Data Rule #1 - Closer the data is to what you are predicting, the better!
  • Data Rule #2 - The data will never be in the format you need - pandas for this
  • github.com/JerryKurata/MachineLearningWithPython

Loading Cleaning and Inspecting Data

Import Libs

  • import pandas as pd ### pandas isa dataframe lib
  • import mathplotlib.pyplot as plt ### matplotlib.pyplot plots data
  • import numpy as np ### numpy provides N dimensional object support

do plotting inline instead of a seperate window

  • matplotlib inline

Load and review data

  • df = pd.read_csv ("./data/pima-data.csv")
  • df.shape ### structure number of rows and columns
  • df.head (5) ### start of data
  • df.tail (5) ### tailof data

feature can be a column or multiple columns

Columns to eliminate

  • Not used
  • No values
  • Duplicates
  • Correlated columns

Pandas and matplot

  • df.isnull ().valye.any () ### check for null values
  • matplot to find correlated values
  • df.corr ()

Molding data

  • adjusting data types
  • creating new columns if req
  • enums or 1 and 0s
  • DataRule#3 - Accurately predicting rare event is true
  • DataRule#4 - Track how you manipulate the data