NLP Project

Data Science Job Listing Recommender

Abstract

The aim of this project was to create a recommender for job listings that helps data science job candidates find roles that suit them. I obtained the text of job listings by scraping Linkedin.com. Using the search term "Data Scientist" for about 50 locations in the English-speaking world, I obtained nearly 10,000 unique job listings with with either "Data Scientist" or "Machine Learning" in the title. Using regex, I attempted to extract from each job listings the section relevant to my recommender: i.e. that relating to what the actual job would entail doing. I then performed text-preprocessing before transforming the corpus into a 'bag of words' and undertaking topic modelling. The topic modelling technique I had most success with was NMF. I created a webapp using streamlit, which obtains user input for preference relating to topics / aspects of job. Having filtered on experience level, the app renders the 10 closest job listings in terms of cosine similarity to those expressed user preferences. The user can read the job listing on the app or click on a link to see the listing on Linkedin. The app is accessible here.

Data

Scraped web with Linkedin search on “data scientist” for about 50 locations
Only stored job listings with either “data scientist” or “machine learning” in title
Obtained nearly 13,000 listings of which roughly one third were duplicates
Extracting only the section of each job listing describing contents of job was difficult
Tackling this task with regex, I ended up with a corpus of just under 4000 observations

Algorithms

I experimented with the following: LSA, NMF (with different vectorizers), LDA, CorEx
CorEx with anchor words produced clean and useful-looking topics, but this didn't seem to translate into relevant job listings
As a result TF-IDF and NMF won out as the topic modelling methodology for this project

Tools

Web-scraping: Selenium and Beautiful soup
Data manipulation: Pandas
Data storage: Pickle
Text preprocessing: Python re and spaCy
Vectorizer: TF-IDF
Topic Modelling: NMF
App: Streamlit

Communication

The methodology and results of this project have been communicated via this document, and via a recorded presentation and its slide-deck.

The URL for the app is: https://share.streamlit.io/billbell73/nlp_project/main/streamlit_app/my_app.py

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
jupyter_notebooks		jupyter_notebooks
streamlit_app		streamlit_app
Presentation.pdf		Presentation.pdf
README.md		README.md
project_proposal.md		project_proposal.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Project

Data Science Job Listing Recommender

Abstract

Data

Algorithms

Tools

Communication

About

Releases

Packages

Languages

billbell73/nlp_project

Folders and files

Latest commit

History

Repository files navigation

NLP Project

Data Science Job Listing Recommender

Abstract

Data

Algorithms

Tools

Communication

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages