Skip to content

Scraper used for recording changes to Portland jail database

Notifications You must be signed in to change notification settings

NguyenDa18/Portland-Jail-Data-Crawler

Repository files navigation

Multnomah County Jail Crawler

scraper-pdx-jail

Portland Justice

Purpose

Crawl through bookings of PDX Jail Database for data analysis and data transparency purposes. Update data files with scheduled jobs courtesy of GitHub actions.

  • Visit Multnomah County Online Inmate Data website: use URL for all inmates in custody: Link
  • Scrape inmate names and booking dates and update csvs/inmate_bookings.csv file
  • Visit each inmate link and update csvs/inmate_details.csv with inmate details and total amounts for each type of charge against them
  • Update csvs/inmate_charges.csv with list of charges for all inmates
  • Update JSON files in counts folder with counts of each category daily

Scraper Details

  • Located at inmates_spider/inmates_spider/spiders/inmates.py
  • Generate Dataframe of inmates and booking dates and update csvs/inmate_bookings.csv, sort by descending order of booking dates
  • Follow each inmate's URL and generate metadata for each inmate, update inmates_charges MongoDB database with charge totals data

Using

  • BeautifulSoup
  • Pandas
  • GitHub Actions (for cron job running scraper)
  • MongoDB (using pymongo Python package)

Enhancements

  • Storing data to a Database
  • Optimizing crawling
  • Using Scrapy Spider instead of BeautifulSoup
  • Creating UI for viewing data
  • Send notification when a "red flag" is released

Running It Yourself

Prerequisite: Python 3 needs to be installed

  1. Clone repo
  2. Activate Virtual Environment
source venv/bin/activate
  1. Install dependencies in Virtual Environment
pip install -r requirements.txt
  1. Best way to experiment is using Jupyter Notebook:
jupyter notebook

Then run experimental code in Sandbox Notebook.ipynb

About

Scraper used for recording changes to Portland jail database

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published