Skip to content

lokeshsam55/Data-Processing-in-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Data-Processing-in-Machine-Learning

Basics of processing the Machine Learning Dataset

The Objective of this project was to process the data before applying to any machine learning model and Spyder Python 3.6 is used

###Code contains information about reading Dataset and applying basics methods to it

Import important libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Dataset is loaded with segregating dependent and independent columns

dataset= pd.read_csv('Data.csv')
X= dataset.iloc[:, :-1].values  [ Considering all rows and columns except last column ]
x = pd.DataFrame(X)
Y= dataset.iloc[:,3].values [Considering all rows and only last column ] 
y = pd.DataFrame(Y)

Imputer is for filling the empty fields in the data

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN',strategy='mean', axis= 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

LabelEncoder is for converting categorical string into int and OneHotEncoder [Dummy encoding] for nullifying the int value

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_x = LabelEncoder()
X[:, 0] = label_x.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
label_y = LabelEncoder()
Y = label_y.fit_transform(Y)

Splitting the data into train and test data

from sklearn.cross_validation import train_test_split However the "cross_validation" name is now deprecated and was replaced by "model_selection" inside the new anaconda versions. Therefore you might get a warning or even an error if you run this line of code above.

To avoid this, you just need to replace: from sklearn.cross_validation import train_test_split by from sklearn.model_selection import train_test_split test_size should be 0.25 to 0.3 or 0.4 max

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train , Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

Feature scaling Basically works on euclidean distance
standardisation and normalization xstand=x-mean(x)/standard deviation(x) xnorm=x-min(x)/max(x)-min(x)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_train)

About

Process the data for model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages