Skip to content

Scrape OKC profiles into pandas DataFrames for exploration

License

Notifications You must be signed in to change notification settings

stevendevan/OKCScrape

Repository files navigation

Notice

I've started a redo of the webscraper in Python 3 and now as a command line app. Find it here.

OKCScrape

Scrape OKC profiles into pandas DataFrames for exploration

Info for recruiters

Due to the fact that essentially no business-side data is available to scrape, I have chosen 'user essay total word count' as a candidate for predictive modeling.

Using OKCScrape

As I mention in the code comments, this project is adapted from a ~4-year old project, back when OKC profiles had tabulated data. Most of the data is now contained in sentences generated from user information, and is much more difficult to extract data from.

To use fetchusers.py, you will need to grab your OKCupid cookies from your browser and feed those to selenium. Check out the code to see how I currently do it. The way you extract, store, and retrieve your cookies is open-ended, so you can accomplish that however is easiest for you.

Dependencies

I originally used Python 2.7 for this project, but it could easily be adapted to 3.x I'm not using a virtual environment, so I'll list the python dependencies here.

Notes:

  • I'm using regex instead of the built in re module because there was some missing functionality in re
  • I don't know that much about absl-py. The original project used gflags, which has now apparently been deprecated in favor of absl-py. You might know of a better way to implement this functionality.

Other stuff

You'll need the chromedriver executable. This is what selenium will be using to fetch webpages, as opposed to your normal browser. You could probably use another headless browser instead, but this is what I'm using.

Downloads page for the chromedriver executable

About

Scrape OKC profiles into pandas DataFrames for exploration

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published