Skip to content

A FastAPI web application that automatically collects Twitter posts about police presence and classifies them for the type of force described via a BERT machine learning model.

License

Notifications You must be signed in to change notification settings

hillarykhan/human-rights-first-police-ds-a

 
 

Repository files navigation

Overview

The Human Rights First Organization is a US-based nonprofit, nonpartisan organization concerned with international human rights. At its forefront are American ideals and universal values. For nearly 40 years HRF has challenged the status quo by highlighting the global struggle for human rights and stepping in to demand reform, accountability and justice. The goal of this project is to create a fully functioning web application capable of visually demonstrating valid and current incidences of police use of force within the United States. The information will help users, such as journalists and passersby, to formulate their perspectives on current matters. The exemplary user interface immediately captures attention with the clusters of incidence shown by geotagging.

This project has been worked on by many Lambda labs teams over the past 10 months. In the final month of development, Labs Cohort 36 was tasked with finalizing our codebase and architecture to deploy a production-ready app. This included: automating our collection of Twitter data, deploying to AWS Elastic Beanstalk, adapting our database architecture to the backend team's schema, labeling 5,000 tweets to retrain our BERT model, creating performance metrics for our model, cleaning our codebase, and updating the documentation.


Features

Deployed Product

Front End Dashboard | Data Science API


Twitter Scraper

  • Automated through the FastAPI framework in main.py to run every four hours
  • Everytime it runs, will randomly select a search query from a set of phrases (police, police brutality, police abuse, police violence) to use in the Twitter API search
  • Relevant functions for the scraper feature can be found in scraper.py

BERT Model

BERT is an open-source, pre-trained, natural language processing (NLP) model from Google. The role of BERT in our project is to take the tweets collected from our Twitter scraper and predict whether or not the tweet discusses police use-of-force and what type of force they used. BERT uses a 6-rank classification system as follows:

  • Rank 0: No police presence.
  • Rank 1: Police are present, but no force detected.
  • Rank 2: Open-hand: Officers use bodily force to gain control of a situation. Officers may use grabs, holds, and joint locks to restrain an individual.
  • Rank 3: Blunt Force: Officers use less-lethal technologies to gain control of a situation. Baton or projectile may be used to immobilize a combative person for example.
  • Rank 4: Chemical & Electric: Officers use less-lethal technologies to gain control of a situation, such as chemical sprays, projectiles embedded with chemicals, or tasers to restrain an individual.
  • Rank 5: Lethal Force: Officers use lethal weapons (guns, explosives) to gain control of a situation.

The BERT model does not currently live in the GitHub repository due to its large file size. When running the app locally, it is best to manually store the saved_model file in the app directory.


Notebooks

There are two notebooks pertaining to the model:

  • BertModel.ipynb: trains a BERT instance based on the data given to it from the training table in our postgres AWS database
  • BertPerformance.ipynb: used for statistical analysis and to calculate model performance metrics (i.e. binary and multi-classification confusion matrices, accuracy, etc.)

These notebooks can be accessed from your virtual environment once all dependencies are installed within it. Two additional libraries, Transformers and psycopg2-binary, are both installed after running the first cell in the notebooks.


DS Architecture

Architecture


Old Codebase

Old and currently undeployed code is stored in the archive folder of the repo. Some files are stored to show the evolution of the code from previous Lambda cohorts to the current deployed code. Some files are starter codes that could help provide inspiration for features that were deprioritized for initial release (e.g. conversational Twitter Bot). A more in-depth description of each of the files is stored in a markdown file in the archive directory.


Next Steps

For those interested in improving upon the data science codebase, here are some recommendations:

  • Explore the efficacy of separating the AWS 'postgres' database into two different databases. The first database would be the primary database for the Twitter scraper outputs and DS would redesign the schema to fit their needs. The second database would be the primary database for backend and they could extract data from the DS database and fit the schema to their needs. Currently, the primary AWS data table 'force_ranks' is accessible in both the data science and backend codebases.
  • Develop an evidence-based strategy to maximize the effectiveness of our Twitter queries in the scraper feature. Currently, the Twitter API has a 500 tweet limit per scraping. This would include developing metrics to compare querying methods. Metrics would allow us to determine which methods return a greater percentage of tweets describing police use-of-force in the United States.
  • Continue to improve BERT model performance. There is a deactivated labeler web application created by Robert Sharp that is connected to a repository of nearly 300,000 unlabeled tweets. The model was retrained at the end of July with roughly 6,000 manually labeled tweets. Labeling about 4,000 more to retrain the model and assess performance improvements may be worthwhile. Alternatively, the model has greater difficulty identifying use-of-force rankings 2, 3, and 4. Implementing a strategy to increase the number of tweets the model sees regarding these classifications could improve the model in a more targeted way.



Labs 36 Contributors

Hillary Khan Marcos Morales Eric Park
Data Scientist Data Scientist Data Scientist



Getting Started

Dependencies

pandas numpy scikit-learn torch transformers spacy plotly tweepy beautifulsoup4 SQLAlchemy dataset python-dotenv uvicorn fastapi fastapi-utils


Environment Variables

In order for the app to function correctly, the user must set up their own environment variables. There should be a .env file containing the following:

1. Twitter API Connection - through tweepy - use HRF twitter developer account.
	a. CONSUMER_KEY
	b. CONSUMER_SECRET
	c. ACCESS_KEY
	d. ACCESS_SECRET
2. Postgres database connection 
	a. DB_URL

Installation Instructions and running API locally

For AWS deployment we used requirement.txt to store our dependencies. Here are steps to create a virtual environment and install dependencies from our requirements.txt to run the app locally. Alternative instructions for creating a pipfile with pipenv follow. All code is for Unix/macOS. Here are the Windows equivalents for creating a virtual environment with pip.

  1. clone the repo
  2. cd into repo
  3. create virtual environment:
$ python3 -m venv name_for_env
  1. activate virtual environment:
$ source name_for_env/bin/activate
  1. check activation:
$ which python
# should return:
#   name_for_env/bin/python
  1. install all dependencies with requirements.txt:
$ python3 -m pip install -r requirements.txt
  1. run the API locally on your machine
$ gunicorn app.main:app -w 1 -k uvicorn.workers.UvicornWorker

Or

uvicorn app.main:app --reload
  1. close the app with control+c in terminal
  2. deactivate environment:
$ deactivate

If you prefer to use pipenv and create a pipfile from our requirements.txt:

  1. clone the repo
  2. cd into repo
  3. install pip environment
$ pipenv install

will create a pipfile for you 4. activate the environment

$ pipenv shell
  1. run the API locally on your machine
$ gunicorn app.main:app -w 1 -k uvicorn.workers.UvicornWorker

Or

uvicorn app.main:app --reload
  1. close the app with control+c in terminal
  2. deactivate environment:
$ exit

How to access DB from browser

CredentialsMap

About

A FastAPI web application that automatically collects Twitter posts about police presence and classifies them for the type of force described via a BERT machine learning model.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 95.9%
  • Python 4.1%