Data analysis and modeling were performed on a large dataset of students applying to college. Interesting correlations and predictive models created.
The popular subreddit, r/collegeresults includes many posts of high school students with the stats and college acceptances/rejections. Here is a screenshot of a post:
You can see how it is very useful data that we can extract.
I first tried web scraping the data using the selenium web driver. This took a long time and was unsuccessful. I next tried the Reddit API, however, this only had record of the 1000 most recent posts. Next, I tried the pushshift API, however this was currently broken. Lastly, I used the Reddit dump files and artic shift, which you can find here:
https://www.reddit.com/r/pushshift/comments/1akrhg3/separate_dump_files_for_the_top_40k_subreddits/
The code used to process this is here: 1-DataPrep\1-data-collect-artic-shift
The data from Reddit was not clean and organized. It came in paragraph form.
I opted to use Regex in order to extract data, such as r"(?:Hooks \(Recruited Athlete, URM, First-Gen, Geographic, Legacy, etc.\)|Hooks):\s*(.*)"
.
At the end of this processing, we had a CSV file that looked like this:
The code for this can be found here: 1-DataPrep/3-extract-values
After step 3, we had a dataset, however each value was different. For example, in the 'Gender' column, there was: "Guy" "dude" "Dude" "male" "Male" "man" and "BOY" just for Males alone. In addition, the RANK, GPA, and SAT, had to be processed
At the end of processing (code found here: 1-DataPrep\5-encode-data), a ready to use CSV file was created, as shown below:
- First, I performed basic data analysis on each university by extracting the following features:
- Total Applications
- Acceptances, Waitlists, Rejections
- Average [Accepted, Waitlist, Reject] [SAT, Extra Curricular, Essays]
Above are the columns generated
- Second, I extracted correlations between numerical features using a correlation matrix. You can view the matrix plot in the 2-visualize/res folder
Above is a correlation matrix from Harvard
- Third, I extracted the correlations between categorical features by using a statistical chi^2 test. I only compared each feature against the target (desicion)
Above are the columns generated of the p-values for each categorical feature
- Fourth, I used partial dependency plots in order to see how each feature affects the decision.
Above is an example PDP from USCS. You can see how higher essays, SAT, and extracurriculars are dependent (correlated) with a positive decision outcome.
- Fifth, I performed PCA and created an ML model trained on the data Since many of the features are not linear (for example State), I opted for Random Forest Classifier. For each school, the accuracy differed between 21% and 94%. Many of the schools where above 65%, proving that it is indeed possible to train a machine learning model and predict admissions outcomes. It leaves us with a question, do colleges make admission decisions using AI?
Above is an interesting image after conducting PCA on a dataset of MIT applications. It is clear a difference between accepted and denied students, furthering my claim that students can be differianted by a computer just solely by self reported grades, demographics, and essays.
I posted these on my Twitter, which you can find here: https://x.com/shreybirmiwal/status/1819460900346384763
According to the feature importance table (generated after creating a random forest model), most schools only value the SAT by about 16%. On the high end, UT-Dallas values SAT at about 39%.
Furthermore, after a certain point (SAT above ~1500), it does not matter. A 1520 vs a 1540 is negligible in the eyes of a college admission. The partial dependency plots for USCS show a decrease in SAT-acceptance dependence/correlation when SAT is above ~1550.
The graph below shows the log inverse of the p-value of a chi-squared test between hooks (legacy, minorities, etc) and decision
It is clear of a statistically significant correlation between Hooks and decision
No surprise that the top schools have an average SAT very high:
Is this correlation and not causation? I think not correlation:
Notice in the image above Caltech correlation matrix shows a 30% correlation between SAT and extracurriculars. A high SAT likely means you try very hard and have good essays, extracurriculars, recommendations, etc.
Take a look at the correlation matrix from TAMU. Income was:
- correlated 29% with better essays,
- correlated 15% with better extracurriculars
- correlated 12% with higher SAT
I suspect, higher income means better tutors, more internships and research, and more essay reviews.
UT-Austin has some majors guaranteed acceptance if you are top 6% of the class, but not all majors such as engineering. This is why we see UT heavily considers major, and Yale, GTech, etc hardly does so. It is evident by the high log inverse p-value of major against decision.
The avg. log inverse p-value for residence is:
- 3.25 for public schools
- 1.48 for private schools:
Only a 2.19X difference - not as crazy as people make it seem. Some schools actually have equal preferences.
Thanks for reading! I learned a lot about statistics, data processing, regex, feature engineering, correlations and data analysis, and machine learning modeling with random forests!