Our goal in this project is to examine this dataset of job postings, and predict salaries for a new set of postings. This will involve building a model to predict the salaries given in the test dataset.
A practical use of this is for a HR Department of a large company or a Consulting Outfit that needs real-time solutions in order to make effective employment offers to potential hires.
It also finds use in getting to understand current realities in the job market and how businesses can leverage this in order to secure high quality talent, whilst keeping hiring costs low.
The primary tool used for this project is Python 3, along with an extensive array of libraries and packages available for the manipulation of data,and development of predictive modeling algorithms.
This Project is broken down into the following Sections:
Our twofold goal in this project is to:
- Load and Explore Data
- Develop and train a suitable prediction model
With Python, we created "Data" and "Plots" classes and objects to aid us in extracting, manipulating the given data. We also created "FeatEng" and "ModelEvaluation" Classes to help us with creating new features and selecting the best prediction model for employee salaries.
The Datasets provided contains 7 features which can help us determine salaries of various job roles. Salary is thus identified as the target value of our prediction.
jobID - Unique for each job entry. Eventually excluded from building the model. companyID - Unique for 63 companies represented in the Data jobType - Various job roles and levels within the companies degree - Level of educational qualifications of employees in those job desctiptions major - Subject of study in relation to educational qualifications industry - sector of the economy where company belongs i.e. Oil, Health, etc
yearsExperience - years of experience of employees in each data entry milesFromMetropolis - distance of employee residence from the nearest large urban center where his office is located
Salary - Our prediction Target
This involves looking at the Data Summaries and Visualizations in order to:
- Examine the Data
- Discover patterns and relationships between the features
- Identify the types of data
- Clean up the Data
Below is an overview of the Train Dataframe (after merging the 'train_features' and 'train_target' files):
We can see from the Distribution and Boxplots below that there is even distribution in the yearsExperience and milesFromMetropolis, whilst the salary feature has a distribution that is close to normal. There is also the presence of outliers in the Salary Boxplot, which requires some attention. This will be dealt with later in the project.
From the Boxplots showing the categorical features(as seen below), we see that:
- The Senior Job types obviously earn the highest salaries (Strong positive correlation)
- Advanced Degrees tend to attract higher salaries
- Engineering, Business, Math and Computer Science are on the top end of the salary continuum
- The Oil and Finance Industries are the highest paying in the job market represented by this data
The trends spelt out above are corroborated in the Heatmap (Correlation Matrix), which shows jobtype as the most strongly correlated feature with salary. Degree and Major have the strongest positive relationship, understandably:
After inspecting the Salary Distribution Boxplot, we can see the existence of outliers. The lower outlier sits exactly on zero. We used the IQR rule to identify these outliers and then made our determination as follows:
-
The entries above the upper bound (220.5) appear legitimate. Most of those roles are senior roles and therefore the values are realistic. We will leave those values in the train dataset.
-
The entries with zero salary appear faulty, as those positions are apparently not volunteer positions. We will proceed to remove those from the training set.
We'll accomplish this by defining a "clean_data" method when creating the Feature Engineering Class
By applying label encoding to transform the categorical features and by grouping them. The group statistics were selected as new features: group_mean(mean salary), group_median(median salary), group_min(minimum salary), group_max(maximum salary), group_mad(mean absolute deviation)
Also we removed the zero salary values with a clean_data method.
We selected three different Regression Algorithms for Evaluation:
- Linear Regression
- Random Forest Regression
- Gradient Boosting Regressor
The Evaluation Metric selected is the MSE (Mean Squared Error).
For our Baseline Model, we used the calculated mean of the target, and included that amongst the models selected for Evaluation. The model with the lowest MSE was selected as the best model.
After passing the models through our evaluation code, we ended up with the following MSE values:
Baseline: 644.26 Linear Regression: 358.15 Random Forest Regressor: 313.27 Gradient Boosting Regressor: 313.06 (selected as best model)
Key predictors for this model is the group mean salary, followed by yearsExperience, as shown in the Feature Importances plot:
• Files Saved for further testing/deployment • We could seek to further improve the model by creating new features around the yearsExperience, and miles from metropolis features