GitHub - samitchaudhuri/enron-fraud: Identify which Enron employees are likely to have committed fraud using Machine Learning

Inroduction

We want to build a classifier that can identify persons of interest in the Enron scandal [enronWiki] from the confidential company data that was made available to public due to Federal investigation. All data and code is available for download from the git repository [gitRepo].

Data

We use the text data from the

Enron corpus [enronCorpus], a publicly available data set of email messages sent or received by 150 senior managers of the Enron Corporation, and
detailed financial data for top executives.

The Enron trial has been widely covered by news paper articles [enronUsaToday] and other news sources [enronScore] [enronPBS] that indicate a long list of people who

were indicted (14 from Enron, 4 from Merril Lynch)
settled without admitting guilt (total 5, 2 from Enron), or
testified for government in exchange for immunity (8 from Enron).

Any Enron employee from the above list is a person of interest (POI). A hand-generated list of POIs, combined with the email and financial data was given to us in pickle data format as a starting point of our detective work.

After loading the pickle data into a Python dictionary, and early exploration of the data reveals (code available in poi_id.py)

data available for 146 people
21 features are collected on each person
18 people in the data set are marked as POIs

The data set has relatively few examples, only 146, and contains only 18 out of 30 or so POIs. Although fairly small, the data set contains several missing entries that are denoted as 'NaN'.

We wanted to check if POIs have fewer or more NaNs. So for each feature, we compared the percentage of missing entries among POIs and non-POIs and looked for any difference over 30%. The following features showed a large difference in percentage of missing entries

expenses (POIs all known, non-POIs 40% unknown)
email_address (POIs all known, non-POIs 27% unknown)
total_stock_value (POIs all known, non-POIs 39% unknown)
total_payments (POIs all known, non-POIs 16% unknown)
salary (POIs 5.55% unknown, non-POIs 15% unknown)
bonus (POIs 11% unknown, non-POIs 48% unknown)

It is not surprising that more financial information is available on the 18 POIs than on the 128 non POIs. However, we should be careful about using features that have a large percentage of unknown entries.

Besides emails, we also want to explore the amount of money people were making. A scatter plot of salaries and bonuses, shown below, indicates an outlier. A closer look at the financial document table reveals that the outlier data point comes from a row called "TOTAL" which, instead of representing a real person, simply adds up all other examples. Such an outlier should clearly be ignored. After dropping this example from the data, we can still see some outliers that correspond to extraorniary income by some individuals (eg. Ken Lay and Jack Skilling). Insread of rejecting, we need pay closer attention to these outliers for clues to POI patterns.

Features

Instead of creating a POI identifier using all 21 features of the original data set as is, we first performed a process of careful feature selection to

select certain features
create some new features, and to
properly scaled the features.

This section describes the feature selection process.

Even though we don't really know what causes a person to be a POI or not, Enron's tale of corporate greed is widely known. Hence we can intuitively include the key features associated with financial gain; e.g. salary, bonus, and exercised_stock_options. This intuition has also been validated during our outlier investigation, where we noted that people with extremely high salary and exercised_stock_options are often POIs. However for feature selction, we would like to follow an exhaustive, robust and scientific methodology devoid of any room for intuition. In particular, we want to arrive at the simplest, but no simpler, set of features as follows:

add new features to expose hidden patterns (make it no simpler)
select the best features by getting rid of stuff that does not help (make it as simple as possible)

We start by adding new features to draw out as many patterns in the data as possible. From scatter plots we can see that often POIs themselves are the senders (resp. recipients) of the largest number of emails to (resp. from) other POIs. The only exception is Enron's previous CEO, LAVORATO JOHN J, who recieved the largest number of emails from POIs, but himself is not a POI. Therefore we add a new feature called "with_poi" which adds the two features "from_poi_to_this_person", and "from_this_person_to_poi". This combined feature, which is expected to give a stronger discriminating power, improves the precision and recall metrics of the Decison Tree and the AdaBoost classifiers, even though it does not improve the performance of the KNN classifier that we eventually selected. Here is a table showing the effect of the new "with_poi" feature on the precision and recall metrics of several different classifiers.

Algorithm	Original Features		Additional "with\_poi" Feature
Algorithm	Precision	Recall	Precision	Recall
K-Nearest Neighbor	0.74951	0.38450	0.74948	0.36050
Decison Tree	0.37287	0.29550	0.42115	0.33250
Gaussian Naive Bayes	0.48281	0.30900	0.42308	0.33550
Random Forest	0.54555	0.26050	0.45289	0.21150
AdaBoost	0.37516	0.30200	0.41307	0.30650

After adding "with_poi", and removing a text feature called "email_address", we end up with effectively 20 features. How do we know which features indicate a "poi" ? We start by measuring the importance of each individual feature with the Decision Tree classifier. In this context, the importance of a feature defined as the reduction of the impurity criterion brought by that feature. Features with higher importance are likely to have higher discriminating power. This is shown in the following table.

Feature	Importance
exercised\_stock\_options	0.200
expenses	0.174
bonus	0.162
restricted\_stock	0.120
total\_payments	0.113
from\_messages	0.087
from\_this\_person\_to\_poi	0.065
shared\_receipt\_with\_poi	0.048
other	0.030
to\_messages	0.000
with\_poi	0.000
deferral\_payments	0.000
total\_stock\_value	0.000
from\_poi\_to\_this\_person	0.000
deferred\_income	0.000
long\_term\_incentive	0.000
salary	0.000
loan\_advances	0.000
restricted\_stock\_deferred	0.000
director\_fees	0.000

Next we select the minimum set of features that give the most power in classifying. Of multiple feature selection techniques available out there, we choose univariate feature selection, which selects the best features based on the ANOVA F-value of the samples. The SelectKBest utility from Sklearn was used for different values of k, and the selected set of features were tested with two different classifiers, KNN and Decision Tree. Before using the KNN classifier, we used the Minmax scaler to scale the features. The results from each of these classifiers also depend on the parameter tunes: we used the default KNeighborsClassifier() classifier from sklearn, and a hand-tuned Decision Tree classifier with parameters min_samples_split=5, min_samples_leaf=2. The results are summarized in the following table. We also handpicked a set of features based on intuition and brute search, which is reported at the last row of the table. For each classifier, the best combination of precision and recall values is highlighted.

Selector	Classifier
	KNN			Decision Tree
	Precision	Recall	F1	Precision	Recall	F1
2 Best	0.695	0.204	0.315	0.369	0.206	0.265
3 Best	0.797	0.277	0.411	0.371	0.273	0.315
4 Best	0.435	0.042	0.076	0.352	0.261	0.300
5 Best	0.437	0.098	0.159	0.273	0.184	0.220
6 Best	0.249	0.029	0.051	0.295	0.203	0.241
7 Best	0.143	0.015	0.027	0.298	0.223	0.255
8 Best	0.158	0.017	0.030	0.304	0.225	0.259
9 Best	0.002	0.001	0.001	0.299	0.248	0.271
10 Best	0.002	0.001	0.001	0.305	0.248	0.274
11 Best	0.206	0.052	0.083	0.289	0.247	0.266
12 Best	0.125	0.013	0.024	0.290	0.245	0.266
13 Best	0.000	0.000	0.000	0.274	0.239	0.255
14 Best	0.000	0.000	0.000	0.232	0.179	0.202
Manual	0.010	0.002	0.003	0.426	0.337	0.376

For the kNN classifiers, SelectKBest with K=3 produced the best F1 score (which is the harmonic mean of precision and recall). These 3 features are: 'bonus', 'total_stock_value', and 'exercised_stock_options'. However, for the hand-tuned Decision Tree classifier, the manually selected featuers 'salary', 'bonus', 'exercised_stock_options', and 'with_poi', outperformed the best features selected by the SelectKBest utility.

As observed above, feature selection schemes perform differently on different classifiers. Even on the same classifier, the performance of a scheme may vary with different parameter tunings. Additionally a different feature selection utility, e.g. Chi-square, or Recursive Feature Elimination (RFE) may also lead to a different set of features getting selected. In the interest of time these experiments are left as future work.

Feature scaling is an important aspect of certain classification algorithms that trade off one feature against another. As discussed in the next section, we found that the K-Nearest-Neighbor classifier gave the best precision and recall numbers. KNN's Euclidean distance metric is affected by feature scaling. So we performed MinMax scaling before fitting the KNN classifier. This makes a noticeable difference in KNN's performance numbers.

Algorithm

Now we discuss the process of picking an algorithm: which algorithms were tried, and which algorithm was eventually chosen.

The K-Nearest Neighbors (KNN) classifier outperformed all other classifiers by a wide margin. It achieved a precision of 0.72 which is almost twice as good as any other classifier that we attempted.

The second best performer was the Decision-Tree classifier. Performance number of both the classifiers are given in the following table. The new feature "with_poi", described above, helped improve the precision and recall scores of the Decision Tree classifier.

Several other classifiers, including NaiveBayes, AdaBoost, and RandomForest, were also attempted.

The following table summarizes the performance numbers of all the classifiers. For each kind, the default classifiers provided by the sklearn package was used. Later the parameters of best two classifiers, KNN and Decision Tree were tuned for improved performance, as described in the next section.

Classifier	Accuracy	Precision	Recall	F1-score	F2-score
kNN	0.878	0.797	0.277	0.411	0.318
Decision Tree	0.800	0.355	0.363	0.359	0.362
Naive Bayes	0.831	0.430	0.292	0.348	0.312
Random Forest	0.839	0.450	0.214	0.290	0.239
AdaBoost	0.826	0.414	0.307	0.353	0.324

Parameter Tuning

Parameters are arguments passed when creating a classifier, and before fitting. Parameters can make a huge difference in the decision boundary that the classification algorithm arrives at. Sometimes parameter tuning can cause overfitting, and chosing the right parameters while avoiding overfitting is an art.

GridSearchCV was used to systematically tune the parameters of the two best-performing classifiers KNN, and Decision Tree. These results are summarized in the table below. In addition to grid search through possible parameters, the table also shows results of hand-tuning the KNN and Decision-Tree classifiers based on our understanding of the classification algorithms. It is interesting to note the hand-tuned Decision Tree slightly outperforms the grid-searched Decision Tree.

Following observations were made during hand-tuning of KNN classifier

Parameter p=1 (use Manhattan distance instead of Euclidean) did actually help, and improved both the precision and recall scores. Hence we used p=1 in our final classifier.
Parameter leaf_size was varied from 5 to 100 (default=30), but it had no significant effect.

For the Decision-Tree Classifier we tried the following parameter tunes

splitter="random" (default="best") degraded the performance.
min_samples_split=5 (default=2) improved the precision but degraded recall.
default parameter values gave the best balance between precision and recall.

Tuned Classifier	Accuracy	Precision	Recall	F1-score	F2-score	Tuned params
kNN (grid-tuned)	0.893	0.868	0.362	0.511	0.410	{'n_neighbors': 5, 'leaf_size': 30, 'p': 1}
kNN (hand-tuned)	0.821	0.166	0.041	0.066	0.048	{'n_neighbors': 5, 'leaf_size': 30, 'p': 1}
Decision Tree (hand-tuned)	0.827	0.423	0.334	0.373	0.348	{'min_samples_split': 3, 'splitter': 'best', 'min_samples_leaf': 2}
Decision Tree (grid-tuned)	0.828	0.422	0.327	0.368	0.342	{'min_samples_split': 3, 'splitter': 'best', 'min_samples_leaf': 2}

Validation Strategy

Validation gives an estimate of performance on an independent data set. Validation provides counter balance to over fitting. Without validation, it is hard to know when to stop tuning the classifier over training data.

It is common practice to split data into training and test sets, and to use test data for validation. This presents a dilemma because setting aside test data can adversely affect the extent of training of the classifier. Splitting the limited amount of data that we have available on only 146 Enron employees, would make it hard to train and test any classifier. Hence we an alternative technique called cross validation to use the limited amount of data for both training and testing.

A particular type of cross-validation method called StratifiedShuffleSplit was used instead of simpler cross-validation method such as TrainTestSplit. StratifiedShuffleSplit creates random training and test sets multiple times and averages the results over all the tests.

The class distribution of the POIs and non-POIs are heavily skewed at 18 to 128. In such cases, StratifidShuffleSplit tries to maintain the same non-POI:POI ratio in all training and test sets that it creates from the larger data set. Instead of accuracy scores, we used precision, recall, and F1 scores to mitigate the effect of skewed distribution of POIs and non-POIs and validate the results.

Metrics

The KNN classifier with parameters p=1, n_neighbors=5, leaf_size=30 achieved a precision score of 0.868, a recall score of 0.362, and an F1-score (harmonic mean of precision and recall) of 0.511. It also achieves an accuracy score of 0.89, even though accuracy score is not an ideal metric in this case.

The precision score of 0.868 implies that once the classifier has called out an employee POI, there is 87% chance that the employee is really a POI. On the other hand, a recall score of 0.362 implies that give an actual POI, there is 36% chance that the classifier will be able to call him out.

References

[gitRepo] Detecting Enron Fraud

[[enronWiki] Enron Scandal] &enronWiki

[[enronUsaToday] A look at those involved in the Enron Scandal] &enronUsaToday

[enronCorpus] Enron Email Corpus hosted by CMU

[enronNeighbors] Nearest Neighbors

[enronPBS] Enron The Smartest Guys in the Room, PBS Independent Lens

[enronScore] Enron scorecard

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
emails_by_address		emails_by_address
plots		plots
README.md		README.md
enron61702insiderpay.pdf		enron61702insiderpay.pdf
feature_format.py		feature_format.py
final_project_dataset.pkl		final_project_dataset.pkl
final_project_dataset_modified.pkl		final_project_dataset_modified.pkl
poi_email_addresses.py		poi_email_addresses.py
poi_id.py		poi_id.py
poi_names.txt		poi_names.txt
report_enron_poi.html		report_enron_poi.html
tester.py		tester.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inroduction

Data

Features

Algorithm

Parameter Tuning

Validation Strategy

Metrics

References

About

Releases

Packages

Languages

samitchaudhuri/enron-fraud

Folders and files

Latest commit

History

Repository files navigation

Inroduction

Data

Features

Algorithm

Parameter Tuning

Validation Strategy

Metrics

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages