- Tidy datasets are easy to manipulate model and visualize and have a specific structure.
- Data Normalization -- each var in a column -- each observation in a row -- each type of observation unit is aa table
- Data Rule #1 - Closer the data is to what you are predicting, the better!
- Data Rule #2 - The data will never be in the format you need - pandas for this
- github.com/JerryKurata/MachineLearningWithPython
- import pandas as pd ### pandas isa dataframe lib
- import mathplotlib.pyplot as plt ### matplotlib.pyplot plots data
- import numpy as np ### numpy provides N dimensional object support
- matplotlib inline
- df = pd.read_csv ("./data/pima-data.csv")
- df.shape ### structure number of rows and columns
- df.head (5) ### start of data
- df.tail (5) ### tailof data
- Not used
- No values
- Duplicates
- Correlated columns
- df.isnull ().valye.any () ### check for null values
- matplot to find correlated values
- df.corr ()
- adjusting data types
- creating new columns if req
- enums or 1 and 0s
- DataRule#3 - Accurately predicting rare event is true
- DataRule#4 - Track how you manipulate the data