Abstract The diabetes dataset is a binary classification problem where it needs to be analysed whether a patient is suffering from the disease or not on the basis of many available features in the dataset. Different methods and procedures of cleaning the data, feature extraction, feature engineering and algorithms to predict the onset of diabetes are used based for diagnostic measure on Pima Indians Diabetes Dataset.
Keywords machine learning; Pima Indians Diabetes dataset; binary classification; features; feature extraction; feature engineering; support vector machine; MLP; neural netwroks; Decision tree; Linear regression heat map; pairplot; violin plot; feature importance.
Database - Pima Indians Diabetes Dataset Pima Indian Diabetes dataset has 9 attributes in total. All the person in records are females and the number of pregnancies they have had has been recorded as the first attribute of the dataset. Second is the value of Plasma glucose concentration a 2 hours in an oral glucose tolerance test and then is the Diastolic blood pressure (mm Hg), fourth in line is the Triceps skin fold thickness (mm), then is the 2-Hour serum insulin (mu U/ml), sixth is Body mass index (weight in kg/ (height in m) ^2) and then seventh is the Diabetes pedigree function and the second last value is the that of the Age (years). The ninth column is that of the Class variable (0 or 1), 0 for no diabetes and 1 for the presence. To start with we first take a description of the dataset. We infer not much from this except the facts like we have a data datset of 768 lines and the maximum values of the Age and Pregnancies. Nothing more is of much use for the prediction. We also calculated the number of datsets that were positive to the test of diabetes and those who were negetive and the value came out to be 268 and 500 respectively. We decided to take the mean value of BMI and found that the average value of a person suffering from the disease has mean BMI value as 35.14 which means that they are not healthy and obese. It is also interesting to note that the mean BMI value for the people who are not suffering from the disease is 30 which is the threshold value of people becoming obese. The mean value of the second parameter Glucose (Plasma glucose concentration) was done we found that those who suffered from the disease had mean value as 141.25 which indicates pre-diabetic state of hyperglycaemia that is associated with insulin resistance and increased risk of cardiovascular pathology.
Methodology A. A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning. Decision tree algorithm follows: • The attribute/feature best for set is taken as root • Distribute the set into different sets having same attribute values for particular value. • Repeat the above steps till we get to the leaf nodes of the tree where no further division can take place. B. In statistics, linear regression is a linear approach for modelling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. (This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.) C. A multilayer perceptron (MLP) is a class of feedforward artificial neural network. An MLP consists of at least three layers of nodes. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable. D. In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyse data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.
Conclusion We applied many algorithms and did a lot of feature manipulation and extraction. We got the best accuracy of 80.5% using SVM. A lot of information about the dataset was also extracted without using complex algorithms. We were also able to perform a lot of exploratory data analysis and came to many conclusions. Random Forest and Ensemble Learning can probably find a better result. Our result was also very close to the best result found and this shows that at the right parameters SVM can be a good and practical choice to classify a medical data.