Skip to content

Machine learning model trained to predict what college students will be accepted into based on factors (grades, extracurriculars, essays). Data from webscraping Reddit

Notifications You must be signed in to change notification settings

shreybirmiwal/college-predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CollegePredictor

Data analysis and modeling were performed on a large dataset of students applying to college. Interesting correlations and predictive models created.

Methodology

1. Finding a Dataset

The popular subreddit, r/collegeresults includes many posts of high school students with the stats and college acceptances/rejections. Here is a screenshot of a post:

image

You can see how it is very useful data that we can extract.

2. Gathering Dataset

I first tried web scraping the data using the selenium web driver. This took a long time and was unsuccessful. I next tried the Reddit API, however, this only had record of the 1000 most recent posts. Next, I tried the pushshift API, however this was currently broken. Lastly, I used the Reddit dump files and artic shift, which you can find here:

https://www.reddit.com/r/pushshift/comments/1akrhg3/separate_dump_files_for_the_top_40k_subreddits/

The code used to process this is here: 1-DataPrep\1-data-collect-artic-shift

3. Extracting Features

The data from Reddit was not clean and organized. It came in paragraph form. I opted to use Regex in order to extract data, such as r"(?:Hooks \(Recruited Athlete, URM, First-Gen, Geographic, Legacy, etc.\)|Hooks):\s*(.*)".

At the end of this processing, we had a CSV file that looked like this:

image

The code for this can be found here: 1-DataPrep/3-extract-values

4. Standardizing Features

After step 3, we had a dataset, however each value was different. For example, in the 'Gender' column, there was: "Guy" "dude" "Dude" "male" "Male" "man" and "BOY" just for Males alone. In addition, the RANK, GPA, and SAT, had to be processed

At the end of processing (code found here: 1-DataPrep\5-encode-data), a ready to use CSV file was created, as shown below: image

5. Data Analysis

  1. First, I performed basic data analysis on each university by extracting the following features:
  • Total Applications
  • Acceptances, Waitlists, Rejections
  • Average [Accepted, Waitlist, Reject] [SAT, Extra Curricular, Essays]

image Above are the columns generated

  1. Second, I extracted correlations between numerical features using a correlation matrix. You can view the matrix plot in the 2-visualize/res folder

HARVARD Correlation Matrix Above is a correlation matrix from Harvard

  1. Third, I extracted the correlations between categorical features by using a statistical chi^2 test. I only compared each feature against the target (desicion)

image Above are the columns generated of the p-values for each categorical feature

  1. Fourth, I used partial dependency plots in order to see how each feature affects the decision.

image Above is an example PDP from USCS. You can see how higher essays, SAT, and extracurriculars are dependent (correlated) with a positive decision outcome.

  1. Fifth, I performed PCA and created an ML model trained on the data Since many of the features are not linear (for example State), I opted for Random Forest Classifier. For each school, the accuracy differed between 21% and 94%. Many of the schools where above 65%, proving that it is indeed possible to train a machine learning model and predict admissions outcomes. It leaves us with a question, do colleges make admission decisions using AI?

image

Above is an interesting image after conducting PCA on a dataset of MIT applications. It is clear a difference between accepted and denied students, furthering my claim that students can be differianted by a computer just solely by self reported grades, demographics, and essays.

Interesting Findings

I posted these on my Twitter, which you can find here: https://x.com/shreybirmiwal/status/1819460900346384763

#1 The SAT score doesn't "really" matter

According to the feature importance table (generated after creating a random forest model), most schools only value the SAT by about 16%. On the high end, UT-Dallas values SAT at about 39%.

image

Furthermore, after a certain point (SAT above ~1500), it does not matter. A 1520 vs a 1540 is negligible in the eyes of a college admission. The partial dependency plots for USCS show a decrease in SAT-acceptance dependence/correlation when SAT is above ~1550.

#2 The ivies care about hooks

The graph below shows the log inverse of the p-value of a chi-squared test between hooks (legacy, minorities, etc) and decision

image

It is clear of a statistically significant correlation between Hooks and decision

#3 Top schools have top SAT. Is it correlation?

No surprise that the top schools have an average SAT very high: image

Is this correlation and not causation? I think not correlation: image

Notice in the image above Caltech correlation matrix shows a 30% correlation between SAT and extracurriculars. A high SAT likely means you try very hard and have good essays, extracurriculars, recommendations, etc.

#4 Higher income means better college outcomes

image

Take a look at the correlation matrix from TAMU. Income was:

  • correlated 29% with better essays,
  • correlated 15% with better extracurriculars
  • correlated 12% with higher SAT

I suspect, higher income means better tutors, more internships and research, and more essay reviews.

#5 It does matter what major you apply for

image UT-Austin has some majors guaranteed acceptance if you are top 6% of the class, but not all majors such as engineering. This is why we see UT heavily considers major, and Yale, GTech, etc hardly does so. It is evident by the high log inverse p-value of major against decision.

#6 You can get into a state school even if you don't live in that state

image The avg. log inverse p-value for residence is:

  • 3.25 for public schools
  • 1.48 for private schools:

Only a 2.19X difference - not as crazy as people make it seem. Some schools actually have equal preferences.

Acknowledgements

Thanks for reading! I learned a lot about statistics, data processing, regex, feature engineering, correlations and data analysis, and machine learning modeling with random forests!

About

Machine learning model trained to predict what college students will be accepted into based on factors (grades, extracurriculars, essays). Data from webscraping Reddit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published