Data-Processing-in-Machine-Learning

Basics of processing the Machine Learning Dataset

The Objective of this project was to process the data before applying to any machine learning model and Spyder Python 3.6 is used

###Code contains information about reading Dataset and applying basics methods to it

Import important libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Dataset is loaded with segregating dependent and independent columns

dataset= pd.read_csv('Data.csv')
X= dataset.iloc[:, :-1].values  [ Considering all rows and columns except last column ]
x = pd.DataFrame(X)
Y= dataset.iloc[:,3].values [Considering all rows and only last column ] 
y = pd.DataFrame(Y)

Imputer is for filling the empty fields in the data

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN',strategy='mean', axis= 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

LabelEncoder is for converting categorical string into int and OneHotEncoder [Dummy encoding] for nullifying the int value

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_x = LabelEncoder()
X[:, 0] = label_x.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
label_y = LabelEncoder()
Y = label_y.fit_transform(Y)

Splitting the data into train and test data

from sklearn.cross_validation import train_test_split However the "cross_validation" name is now deprecated and was replaced by "model_selection" inside the new anaconda versions. Therefore you might get a warning or even an error if you run this line of code above.

To avoid this, you just need to replace: from sklearn.cross_validation import train_test_split by from sklearn.model_selection import train_test_split test_size should be 0.25 to 0.3 or 0.4 max

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train , Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

Feature scaling Basically works on euclidean distance
standardisation and normalization xstand=x-mean(x)/standard deviation(x) xnorm=x-min(x)/max(x)-min(x)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_train)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Data.csv		Data.csv
README.md		README.md
code_data.py		code_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Processing-in-Machine-Learning

About

Releases

Packages

Languages

lokeshsam55/Data-Processing-in-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Data-Processing-in-Machine-Learning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages