-
Notifications
You must be signed in to change notification settings - Fork 25
Regression Assignment
The due date is Thursday, Jan 23 ##The Problem In this assignment, we are going to recreate the Kaggle competition to predict salaries based on job descriptions: http://www.kaggle.com/c/job-salary-prediction. This problem came from a company called Adzuna, that wanted to be able to predict the salary of a job just based on it's description.
##The Data We've paired down the problem a bit to make the dataset small and allow you to complete it in a week. There are 5 important files available.
The first four files are files you can train you models on, start by using train.csv. It contains 10k training examples, but the larger files may start to slow down your system requiring you to wait longer while the model trains. They contain the following columns:
The test files contains all but the final two columns (the salary information)
At the end you will submit your best predictions for the test file, in the format
Id, Salary
1234, 77999
2345, 88999
Also please submit a commented Python file with some of the things you tried.
##Assignment At any point in the following steps you can use your model to predict salaries on the final test set and submit. Some bullets have stars and rank from Basic (no star) to Challenging (3 stars)
-
-
- 1 star - Try these out but they will be harder
-
- ** - 2 stars - For those looking for a challenge or really explore the topic
- *** - 3 stars - Test out some of the ideas using different tools
-
Split the data into training and test sets (using one of the files 1-4 above). You will use one as your training set and the other for validation. At the end you may train your final parameters on the full dataset.
-
Build a simple linear regression using the available categorical variables. Try adding and dropping parameters and see if they improve the model. Try adding interaction effects to improve your model. (Note: beware of the computational overhead) Compare both R-squared and MAE on your test set.
-
Perform cross-validation to verify that the inclusion (or exclusion) of any variables provides a tangible improvement to the model.
-
Merge Location_Tree.csv on to your dataset - do any features from here improve performance.
To convert this file to a csv try using shell commands:
cat Location_Tree.csv | sed 's/~/,/g' | sed 's/"//g' > Location_Tree2.csv
This command will change the ~ to a comma and remove the quote marks.
Note: DO NOT redirect this to the same file, using >
on a file that exists will wipe it
*5) Use the regularization options available to attempt to improve the model
**6) Now let's try adding some text features
<add code>
**7) If you were just loading the train.csv file so far, try loading one of the large datasets. Does the larger dataset improve performance on the held-out? If possible, try loading an even larger file.
***8) Let's try using the largest dataset, but using a faster package
git clone git://github.com/JohnLangford/vowpal_wabbit.git
cd vowpal_wabbit
make install
Create a training file for vowpal wabbit using your favorite scripting language, the format should be as follows:
<salary> | This is a text feature | Some other feature | Job Title |
Use the |
to separate features. Vowpal Wabbit will handle turning these into dummy variables automatically. Let's assume you've named your new files train.vw
Vowpal Wabbit Examples Vowpal Wabbit Input Validator
Try the following:
vw -c -k train.vw --loss squared -f model
vw -c -k train.vw --loss squared -f model -l1 0.0001 ##for l1 loss
vw -c -k train.vw --loss squared -f model -l2 0.0001 ## for l2 loss
vw -c -k -t test.vw -i model -p test.predictions
Load test.predictions
into Python to compute MAE.