GitHub

Motivation

This research was conducted as part of my Master's Thesis for the MA Computational Social Science at the University of Chicago. I am linguist who speaks 12 languages and has been interested in studying language computationally. I am also a social scientist who loves how weird we humans are, and tries to study their psychology. So this project provided the perfect confluence of my interests- using topic models on dating profiles to understand the social norms around online dating.

The primary research question is whether men who may have faced some disadvantages in dating in the real life (due to their height, weight, race or low education) find ways to strategically market themselves on their dating profiles. In what ways are they different form normal profiles? What is a 'normal' profile anyway? Luckily, we finally have the tools of data science to begin answering some age-old questions on love and romance.

Installations

This codes runs on Python 3.6 for the majority of the analysis, and then switches to R for the final component of STMs (Structural Topic Models). So the required libraries/ packages are listed below for ease of understanding. You can install of the Python modules using

pip install -r requirements.txt

The cleaning of the core text essays is far from a trivial task. Splitting the conjoined words required a total of 79 hours on a 4 Ghz processor for the 20,000+ profiles. I would strongly recommend doing this task in parallel, since none of the individual profiles' results depend on previous profiles. For this purpose, I have leveraged the mrjob framework in conjunction with AWS, which parallelizes and thus completes the task much faster.

Python

scikit-learn
gensim
wordninja
pyspellchecker
beautifulsoup
spacy
textacy
tqdm
nltk
mrjob

R

stm
quanteda
tidyverse

File Descriptions

The repository contains a mix of .py files for utilies, Jupyter notebooks for easy visualization, and an R notebook for the component of the analysis that only exists in R libraries. These files are:

Python Modules (.py)

split_utils.py: Contains functions to help split conjoined words in the dating profile essays
filter_utils.py: Contains functions to standardize some of the categorical variables of interest
text_complexity_utils.py: Contains functions mainly developed from the SpaCy library to analyze the difficulty level of vocabulary and sentence construction
nmf_visuals_utils.py: Contains functions required in conjunction with sklearn to generate the visualizations of topics under the Non-Negative Matrix Factorization topic model. This code largely builds off the seminal work of Shishido et al (2016) and the Computational Social Science workshop at the University of Michigan (See references)

Jupyter Notebooks

data_filter.ipynb- Inputs data, drops missing values and irrelevant variables. Removes unwanted characters from essays, as well as fixing the problem of conjoined words (using split_utils.py) Filters for the population of interest (using filter_utils.py). Also calculates new variables using text_complexity.py
profile_length_analysis.ipynb- Checks for the influence of the 4 categorical variables on profile length
doc2vec_model.ipynb- Converts filtered essay data into vector space representation. Then attempts to find clusters using K-Means and DBScan on data after its dimensionality has been reduced with Principal Components Analysis and t-SNE.
nmf_visualize.ipynb- Calculates an NMF model and then visualizes differences in the proportions of profiles using the generated topics

R Notebooks

quant_eda_stm.rmd- Contains code to perform Latent Dirichlet Analysis (LDA) and build a Structural Topic Model over it

sav files

nmf.sav - The Jupyter notebooks provide all the necessary code to generate the tables and analyses in the blogpost. However, you can skip a few steps and load the model directly from this file.

Data sets: OkCupid Data

How to Interact With the Project

You can install the necessary libraries as shown above, and download the raw data using the link in the next section.
The full dataset is not very large (compared to other datasets frequently being used in text analysis). Make sure you unzip the folder with the folder name unchanged 'profiles.csv', and unzip it to one step back in your directory (not the same folder as your main analysis). This will ensure compatibility with the code posted in the notebooks. You will need to run the Data_Filter.ipynb file first, but are free to run the other analyses in any order.

Licensing, Authors, Acknowledgements, etc.

This project owes significant debts to previous work. These may be found in the work of the following authors, whose contributions have been duly referenced:

It also builds off the initial explorations conducted in Fall 2019 as a team project with my classmates, which can be accessed at its own repository here: Unsupervised Dating

Key Findings:

On the whole, single heterosexual men in the data write profiles of about the same length, with roughly the same use of language, spread over same 5 topics, with the first 3 (the area, hobbies and sociability) taking up most (82%) of it. The differences that emerge for different backgrounds (race, education, fitness and height) only start emerging in the remaining18%.
There does exist some group of men whose choice of words varies significantly from everyone else. But we can't explain this group purely in terms of the four variables
Height and fitness level have the least impact on differences of topic choice, while black ethnicity has the highest affect across all topics
One's love for novelty appears to be the topic most affected by the categories of our chosen variable.

Explore the full analysis either in this fun Medium blogpost or for a more rigorous, academic approach that draws on a variety of related research, you can dig into the thesis here

Future Directions

Based on this experience of porting across Python and R to implement structural models, I resolved to build in the capability of Structural Topic Models into gensim itself. I will soon be submitting a pull request specifically towards its topic modelling classes.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
.gitattributes		.gitattributes
Data_Filter.ipynb		Data_Filter.ipynb
Doc2Vec_Modelling_tsne.ipynb		Doc2Vec_Modelling_tsne.ipynb
NMF_Visualization.ipynb		NMF_Visualization.ipynb
README.md		README.md
coherenceplot_STM.png		coherenceplot_STM.png
ethnicity_treemap.png		ethnicity_treemap.png
full_treemap.png		full_treemap.png
lda_11_outputs.png		lda_11_outputs.png
nmf_model.sav		nmf_model.sav
nmf_utils.py		nmf_utils.py
profiles_filtered.csv		profiles_filtered.csv
quanteda_lda_stm.Rmd		quanteda_lda_stm.Rmd
requirements.txt		requirements.txt
sex_treemap.png		sex_treemap.png
split_utils.py		split_utils.py
text_complexity_utils.py		text_complexity_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Motivation

Installations

Python

R

File Descriptions

Python Modules (.py)

Jupyter Notebooks

R Notebooks

sav files

Data sets: OkCupid Data

How to Interact With the Project

Licensing, Authors, Acknowledgements, etc.

Key Findings:

Future Directions

About

Releases

Packages

Languages

policyglot/topic_models_dating

Folders and files

Latest commit

History

Repository files navigation

Motivation

Installations

Python

R

File Descriptions

Python Modules (.py)

Jupyter Notebooks

R Notebooks

sav files

Data sets: OkCupid Data

How to Interact With the Project

Licensing, Authors, Acknowledgements, etc.

Key Findings:

Future Directions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages