Data and the age of technology has nonetheless benefitted society a great deal. However, it is not to say that such advancements hasn't been accompanied by down-sides. One of which is identity fraud. The ability for fraudsters, scammers, and hackers to access information as they please is unprecedented and unsuspecting victims often succumb to methods such as phone calls, emails or even fake job postings! In an effort to prevent identity theft, using a data set acquired from kaggle, we aim to create a machine learning model that can predict fraud vs non-fraud jobs based on several parameters.
To view an executive presentation our project please visit this link
To view a Tableau dashboard of our project please visit this link
In data analysis, there are several important steps to follow to ensure a good result which allows us to break our project down into 5 segments:
- Initial Data Analysis
- Data Preprocessing
- Feature Engineering
- Train/Test split data for Machine learning
- Validate
- Final result
Much of the data will be summarized with figures, refer to the cleaning file for more detail.
The data imported from the csv was a 17880x18 dataframe and contained various forms of data types as noted below:
Some columns also had a staggering number of null values with 'salary_range' sitting at 15012 nulls
Further exploration also revealed no correlation as shown in the heatmap
Columns job_id, description, and requirements were dropped for the summaries which will be explained in future sections.
The overall fraud rate is relatively low at 95.2% and varied greatly between requirements. As we can see from the figures below, fraud rates are the highest for jobs with the follow attributes: entry level jobs, high school level education, full-time, oil & energy industry jobs, and administrative/engineering jobs.
Aside from the just counts, we should also consider the percentages which gave us the following highest: Administrative - 18.9%, Oil & Energy 38.0%, Full-time 8.5%, and Highschool Ed. 2.0%.
We first dropped the 3 columns job_id, salary, and title due to either too many nulls or all unique values. We were still left with a large amount of nulls but as the dataset was skewed with only 5% fraudulent data entries we did not want to remove the data from the dataset so they were filled with 'not specified'. The location column was then split so that we would only keep the country for consistency. Further modifications include removing punctuation and various characters and then lower casing all characters using the following code
for col in clean_cols:
dataset_df[col] = dataset_df[col].replace(r'[^a-zA-Z0-9\s]', '',regex=True)
dataset_df[col] = dataset_df[col].replace(r'\s{2,}', '',regex=True)
string_cols = list(dataset_df.select_dtypes(include='object'))
for col in string_cols:
dataset_df[col] = dataset_df[col].str.lower()
The data was sub divided into nominal ('department', 'industry', 'function', 'Country') columns and ordinal columns ('employment_type','required_experience','required_education'). Given this: Target encoding was used on ordinal columns in order to convert the categorical values into integers which are related to the mean of the fraudulent target.
Targetenc = TargetEncoder()
for col in nom_cols:
values = Targetenc.fit_transform(X = dataset_df[col], y = dataset_df['fraudulent'])
dataset_df[col] = values[col]
Label encoding was applied to nominal columns in order to normalize the data into integers for the machine learning model.
le = LabelEncoder()
for col in ord_cols:
dataset_df[col] = le.fit_transform(dataset_df[col])
After encoding the data the resulting dataset was made up of primarily numerical data that could be used in the machine learning model
For the four columns with large complex strings such as requirements, description, company, and benefits, we used NLTK and sklearn.
First, the stop-words are removed using a lamba function on the columns and then stemming and lemmatization was applied. Finally, the columns were then combined into a single column and then tokenized.
Once these columns were tokenized term frequency-inverse document frequency (TF-IDF) was used to determine the relevance of each word for each job posting and the relevance of ach word to the whole dataset
Initialization of the ML model followed standard procedure with splitting the data accordingly into test and train data sets with a 10% test size of the overall dataset.
The machine learning model chosen is LightGBM which is a gradient boosting framework that uses tree based learning algorithms. Having 250 iterations and a learning rate of 0.08 takes ~18 seconds to run and achieved an accuracy of ~98% on the testing dataset.
The LightGBM model has the following benefits:
- faster training speed with higher accuracy compared to other models
- lower memory usage
- better compatibility with large datasets
The LightGBM model has the following limitations:
- sensitive to overfitting due to producing more complex trees compared to other models
- sensitive to overfitting on small datasets making LightGBM incompatible with smaller datasets
Looking deeper into our ML model revealed that the industry feature was the most important feature and remained consistent even after using XGBOOST instead of LightGBM. Interestingly, changing the learning rate for our ML model didn't appear to change the results overall in terms of accuracy. Trying various learning rates between 0.05 and 1 along with altering the iterations and increasing and decreasing the test and training dataset sizes all yielded a relatively stable accuracy rating.
Taking a look at our confusion matrix, non-fraudulent jobs were predicted extremely accurately with only a < 1% of non-fraud being predicted as fraud. However, the number of fraudulent job posts predicted as non job fraud was 37.6%
Overall, the machine learning model for detection job fraud proved to be a success. We were able to achieve an accuracy rating of > 98% using the LightGBM model and even with fluctuations in learning rate, the model proved to be quite resilient. We were able to determine that certain sectors of the employment field were more susceptible to job fraud postings such as those with lower requirements (education/experience), administrative/engineering positions.
While the project was a success, the limited availability of posting data in csv format limits the amount of data this machine learning model used for training. Future projects should look to scrape more data from web or use a larger source of data to improve the ML model as the sample size of fraudulent jobs is quite low. Future improvements can enable implementation of this model into various job boards to prevent identity theft.