I've started a redo of the webscraper in Python 3 and now as a command line app. Find it here.
Scrape OKC profiles into pandas DataFrames for exploration
- Look at profileparser.py for regex work and data extraction
- Look at my Ipython notebook to see how I explore data
Due to the fact that essentially no business-side data is available to scrape, I have chosen 'user essay total word count' as a candidate for predictive modeling.
As I mention in the code comments, this project is adapted from a ~4-year old project, back when OKC profiles had tabulated data. Most of the data is now contained in sentences generated from user information, and is much more difficult to extract data from.
To use fetchusers.py, you will need to grab your OKCupid cookies from your browser and feed those to selenium. Check out the code to see how I currently do it. The way you extract, store, and retrieve your cookies is open-ended, so you can accomplish that however is easiest for you.
I originally used Python 2.7 for this project, but it could easily be adapted to 3.x I'm not using a virtual environment, so I'll list the python dependencies here.
- absl-py (the absl-py github has more information)
- BeautifulSoup (This is the old, Python 2.7 version)
- pandas
- regex
- selenium
Notes:
- I'm using regex instead of the built in re module because there was some missing functionality in re
- I don't know that much about absl-py. The original project used gflags, which has now apparently been deprecated in favor of absl-py. You might know of a better way to implement this functionality.
You'll need the chromedriver executable. This is what selenium will be using to fetch webpages, as opposed to your normal browser. You could probably use another headless browser instead, but this is what I'm using.