CollegePredictor

Data analysis and modeling were performed on a large dataset of students applying to college. Interesting correlations and predictive models created.

Methodology

1. Finding a Dataset

The popular subreddit, r/collegeresults includes many posts of high school students with the stats and college acceptances/rejections. Here is a screenshot of a post:

You can see how it is very useful data that we can extract.

2. Gathering Dataset

I first tried web scraping the data using the selenium web driver. This took a long time and was unsuccessful. I next tried the Reddit API, however, this only had record of the 1000 most recent posts. Next, I tried the pushshift API, however this was currently broken. Lastly, I used the Reddit dump files and artic shift, which you can find here:

https://www.reddit.com/r/pushshift/comments/1akrhg3/separate_dump_files_for_the_top_40k_subreddits/

The code used to process this is here: 1-DataPrep\1-data-collect-artic-shift

3. Extracting Features

The data from Reddit was not clean and organized. It came in paragraph form. I opted to use Regex in order to extract data, such as r"(?:Hooks \(Recruited Athlete, URM, First-Gen, Geographic, Legacy, etc.\)|Hooks):\s*(.*)".

At the end of this processing, we had a CSV file that looked like this:

The code for this can be found here: 1-DataPrep/3-extract-values

4. Standardizing Features

After step 3, we had a dataset, however each value was different. For example, in the 'Gender' column, there was: "Guy" "dude" "Dude" "male" "Male" "man" and "BOY" just for Males alone. In addition, the RANK, GPA, and SAT, had to be processed

At the end of processing (code found here: 1-DataPrep\5-encode-data), a ready to use CSV file was created, as shown below:

5. Data Analysis

First, I performed basic data analysis on each university by extracting the following features:

Total Applications
Acceptances, Waitlists, Rejections
Average [Accepted, Waitlist, Reject] [SAT, Extra Curricular, Essays]

Above are the columns generated

Second, I extracted correlations between numerical features using a correlation matrix. You can view the matrix plot in the 2-visualize/res folder

Above is a correlation matrix from Harvard

Third, I extracted the correlations between categorical features by using a statistical chi^2 test. I only compared each feature against the target (desicion)

Above are the columns generated of the p-values for each categorical feature

Fourth, I used partial dependency plots in order to see how each feature affects the decision.

Above is an example PDP from USCS. You can see how higher essays, SAT, and extracurriculars are dependent (correlated) with a positive decision outcome.

Fifth, I performed PCA and created an ML model trained on the data Since many of the features are not linear (for example State), I opted for Random Forest Classifier. For each school, the accuracy differed between 21% and 94%. Many of the schools where above 65%, proving that it is indeed possible to train a machine learning model and predict admissions outcomes. It leaves us with a question, do colleges make admission decisions using AI?

Above is an interesting image after conducting PCA on a dataset of MIT applications. It is clear a difference between accepted and denied students, furthering my claim that students can be differianted by a computer just solely by self reported grades, demographics, and essays.

Interesting Findings

I posted these on my Twitter, which you can find here: https://x.com/shreybirmiwal/status/1819460900346384763

#1 The SAT score doesn't "really" matter

According to the feature importance table (generated after creating a random forest model), most schools only value the SAT by about 16%. On the high end, UT-Dallas values SAT at about 39%.

Furthermore, after a certain point (SAT above ~1500), it does not matter. A 1520 vs a 1540 is negligible in the eyes of a college admission. The partial dependency plots for USCS show a decrease in SAT-acceptance dependence/correlation when SAT is above ~1550.

#2 The ivies care about hooks

The graph below shows the log inverse of the p-value of a chi-squared test between hooks (legacy, minorities, etc) and decision

It is clear of a statistically significant correlation between Hooks and decision

#3 Top schools have top SAT. Is it correlation?

No surprise that the top schools have an average SAT very high:

Is this correlation and not causation? I think not correlation:

Notice in the image above Caltech correlation matrix shows a 30% correlation between SAT and extracurriculars. A high SAT likely means you try very hard and have good essays, extracurriculars, recommendations, etc.

#4 Higher income means better college outcomes

Take a look at the correlation matrix from TAMU. Income was:

correlated 29% with better essays,
correlated 15% with better extracurriculars
correlated 12% with higher SAT

I suspect, higher income means better tutors, more internships and research, and more essay reviews.

#5 It does matter what major you apply for

UT-Austin has some majors guaranteed acceptance if you are top 6% of the class, but not all majors such as engineering. This is why we see UT heavily considers major, and Yale, GTech, etc hardly does so. It is evident by the high log inverse p-value of major against decision.

#6 You can get into a state school even if you don't live in that state

The avg. log inverse p-value for residence is:

3.25 for public schools
1.48 for private schools:

Only a 2.19X difference - not as crazy as people make it seem. Some schools actually have equal preferences.

Acknowledgements

Thanks for reading! I learned a lot about statistics, data processing, regex, feature engineering, correlations and data analysis, and machine learning modeling with random forests!

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
0-Old		0-Old
1-DataPrep		1-DataPrep
2-visualize		2-visualize
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CollegePredictor

Methodology

1. Finding a Dataset

2. Gathering Dataset

3. Extracting Features

4. Standardizing Features

5. Data Analysis

Interesting Findings

#1 The SAT score doesn't "really" matter

#2 The ivies care about hooks

#3 Top schools have top SAT. Is it correlation?

#4 Higher income means better college outcomes

#5 It does matter what major you apply for

#6 You can get into a state school even if you don't live in that state

Acknowledgements

About

Releases

Packages

Languages

shreybirmiwal/college-predictor

Folders and files

Latest commit

History

Repository files navigation

CollegePredictor

Methodology

1. Finding a Dataset

2. Gathering Dataset

3. Extracting Features

4. Standardizing Features

5. Data Analysis

Interesting Findings

#1 The SAT score doesn't "really" matter

#2 The ivies care about hooks

#3 Top schools have top SAT. Is it correlation?

#4 Higher income means better college outcomes

#5 It does matter what major you apply for

#6 You can get into a state school even if you don't live in that state

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages