CREATIVE --- Face_Url_Scraper_2022

Welcome! This repo contains scripts for collecting and organizing face images of political figures for the 2022 US election cycle. The scripts can be used to create a comprehensive dataset of face images.

This repo is a part of the Cross-platform Election Advertising Transparency Initiative (CREATIVE). CREATIVE has the goal of providing the public with analysis tools for more transparency of political ads across online platforms. In particular, CREATIVE provides cross-platform integration and standardization of political ads collected from Google and Facebook. CREATIVE is a joint project of the Wesleyan Media Project (WMP) and the privacy-tech-lab at Wesleyan University.

To analyze the different dimensions of political ad transparency we have developed an analysis pipeline. The scripts in this repo are part of the Data Collection step in our pipeline.

1. Video Tutorial

Creative_Tutorials-face_url_scrapper_2022.1.mp4

If you are unable to see the video above (e.g., you are getting the error "No video with supported format and MIME type found"), try a different browser. We tested the video to work on Google Chrome.

Or, you can also try to watch this tutorial through YouTube.

2. Overview

This repo contains scripts for collecting and organizing face images of political figures for the 2022 US election cycle. It also contains a dataset ready for your use.

The repo contains scripts that scrape and organize face images from various online sources in a hierarchical manner. The primary sources are:
- Ballotpedia
- Wikipedia
- Persons' facebook pages
- Candidate campaign websites
If a suitable face image cannot be found from these primary sources, then we attempt to scrape a face image from VoteSmart as a fallback option since face images from VoteSmart are often not ideal in size (low resolution) for face recognition purposes. The VoteSmart scraper is located in a separate folder /votesmart.

The scraper collects the face images via different categories based on the political figures' roles, such as legislative (House and Senate candidates, sitting Senators), executive (presidents, cabinet members, governors), judicial (Supreme Court Justices), and other prominent persons. You can find more detailed description of face categories in the Face categories section.

The scrapers are named per their sources and categories. For example, a scraper that collects senators' face images via Ballotpedia is called 01_ballotpedia_scaper_senate.ipynb. The script that is organizing (i.e., cleaning up) senators' scraped results from Ballotpedia is called 01_ballotpedia_scaper_senate_cleanup.ipynb. The data produced by each script is stored in the data folder.
After collecting face images from a source a final face image quality control and assemble stage is performed via 07_face_url_final_assemble.ipynb and 08_face_url_final_selection.ipynb. During this stage, a few additional face images may be manually used from other websites to ensure the best possible quality and consistency across the dataset. After the quality control all the collected and organized face images urls are combined into a single csv file: result_face_url_2022.csv via the 08_face_url_final_selection.ipynb script.

Results

Here is a summary of the final face image collection dataset:

1,650 unique wmpids (face image ids) for the 2022 face collection
Among them, 94 wmpids do not have face images

The final output contains the following fields:

wmpid: unique id for each political figure
full_name: full name of the political figure
face_url_2022: the url of the face image
face_category_2022: the category of the political figure

An example of a line in the final output is shown below:

wmpid	full_name	face_url_2022	face_category_2022
WMPID1000	Josh Gottheimer	https://s3.amazonaws.com/ballotpedia-api4/files/thumbs/200/300/Josh_Gottheimer.jpg	House

Face Categories

The face images are collected from various sources and are categorized into the following categories:

Legislative

Candidate source: wmpcand_012523_wmpid.csv

This file contains the wmpid of the candidates in the 2022 election cycle. This is the source file that we use to collect the face images of the candidates in script 01_ballotpedia_scaper_senate_cleanup.ipynb and 02_ballotpedia_scaper_house_cleanup.ipynb.
Face url source: Ballotpedia / Wikipedia

House candidate (1193)
Senate candidate (173)
Sitting US senators -117th (100)

Executive

Current president (1)
All former presidents (44)
- face url source: https://www.whitehouse.gov/about-the-white-house/presidents/
Cabinet (25)
- face url source: https://www.whitehouse.gov/administration/cabinet/
Gubernatorial candidates (82)
- Candidate source: Priors 2022.xlsx provided by ABC News
Sitting governor (14)
- face url source: https://www.nga.org/governors/
Public health related leaders (6)
- The Secretary of the Department of Health and Human Services
- The Surgeon General
- Director of the National Institute of Allergy and Infectious Diseases
- Directors of CDC, FDA, NIH

Judicial

Judicial (16)

Supreme Court Justice (current)
Supreme Court Justice (former) - all who took judicial oath after 1980
- face url source: https://www.supremecourt.gov/about/biographies.aspx

Other Prominent Persons

Other prominent politicians(3): Mike Pence, Hilary Clinton, Robert Mueller
International political leaders (23)
- G20: https://en.wikipedia.org/wiki/G20
- The secretary general of UN
- Director of WHO - Tedros Adhanom
Political historical figures (1)
- MLK

Copyrights of Face Images

The face images are collected from various sources, and the copyrights of the face images are as follows:

Ballotpedia: https://ballotpedia.org/Ballotpedia:Image_use_policy
- GFDL licenses
- "These images are available for reuse in non-commercial settings with attribution. Please use the following language when using any images that belong to Ballotpedia:This image comes from the website Ballotpedia.org. It is suitable for reuse under GFDL licensing."
Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Image_use_policy
- GFDL licenses
Whitehouse: https://www.whitehouse.gov/copyright/
- Creative Commons Attribution 3.0 License: https://creativecommons.org/licenses/by/3.0/us/
- "Share — copy and redistribute the material in any medium or format"
National Governors Association: N.A.
Supreme Court: N.A.

3. Setup

To run the code that replicate the face image collection, you should first clone this repository to your local machine.
```
https://github.com/Wesleyan-Media-Project/face_url_scraper_2022.git
```
In addition, you need also clone the datasets repository to your local machine since some scripts in this repo require candidates information files like wmpcand_012523_wmpid.csv form the datasets repo:
```
https://github.com/Wesleyan-Media-Project/datasets.git
```
The scripts in this repo are in Python and R. Make sure you have both installed and set up before continuing. To install and set up Python you can follow the Beginner's Guide to Python. To install and set up R you can follow the CRAN website. We also recommend using R Studio as an interface of R. Here is the R Studio website. The Python Scripts in this repo uses Jupyter Notebook as an interface. It is an interactive environment for Python development. You can install Jupyter Notebook by following the Jupyter Notebook website.
After you have installed the above software, you need to install the required libraries for both Python and R. First, you need to install the following libraries in R for the code votesmart_scraper_2022.R:
- tidyverse
- rvest
- httr
- xml2
To install them, first open your terminal and type R to open R console. Then, type the following commands:
```
install.packages("tidyverse")
install.packages("rvest")
install.packages("httr")
install.packages("xml2")
```
Next, install the required dependencies for the Python scripts. You can install them by running the following command in your terminal:
```
pip install pandas numpy beautifulsoup4 fuzzywuzzy
```
After you have installed the required libraries, you can now run the code follow the order of the numbers in the file names (e.g., you can start with: 01_ballotpedia_scaper_senate_cleanup.ipynb, then 01_ballotpedia_scaper_senate.ipynb, then 02_ballotpedia_scaper_house_cleanup.ipynb ... 08_face_url_final_selection.ipynb).

To run the above IPython Notebook code ending with .ipynb, you can open the Jupyter Notebook interface by type the following in your terminal:
```
jupyter notebook
```
After you open the Jupyter Notebook interface, you can navigate to the folder where you have cloned the repo and open the script you want to run.

Then, click on the first code cell to select it. Run each cell sequentially by clicking the Run button or pressing Shift + Enter.
To run the R script votesmart_scraper_2022.R, you can type the following command in your terminal:
```
Rscript votesmart/votesmart_scraper_2022.R
```

Note: When re-running this script, face images will be replaced with current office holders on relevant official websites.

4. Thank You

We would like to thank our supporters!

This material is based upon work supported by the National Science Foundation under Grant Numbers 2235006, 2235007, and 2235008.

The Cross-Platform Election Advertising Transparency Initiative (CREATIVE) is a joint infrastructure project of the Wesleyan Media Project and privacy-tech-lab at Wesleyan University in Connecticut.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
data		data
votesmart		votesmart
.gitignore		.gitignore
01_ballotpedia_scaper_senate.ipynb		01_ballotpedia_scaper_senate.ipynb
01_ballotpedia_scaper_senate_cleanup.ipynb		01_ballotpedia_scaper_senate_cleanup.ipynb
02_ballotpedia_scaper_house.ipynb		02_ballotpedia_scaper_house.ipynb
02_ballotpedia_scaper_house_cleanup.ipynb		02_ballotpedia_scaper_house_cleanup.ipynb
03_ballotpedia_scaper_gov.ipynb		03_ballotpedia_scaper_gov.ipynb
03_ballotpedia_scaper_gov_cleanup.ipynb		03_ballotpedia_scaper_gov_cleanup.ipynb
04_incumbent_governor.ipynb		04_incumbent_governor.ipynb
04_incumbent_governor_cleanup.ipynb		04_incumbent_governor_cleanup.ipynb
05_other_scrapers.ipynb		05_other_scrapers.ipynb
06_sitting_senator_scrapers.ipynb		06_sitting_senator_scrapers.ipynb
07_face_url_final_assemble.ipynb		07_face_url_final_assemble.ipynb
08_face_url_final_selection.ipynb		08_face_url_final_selection.ipynb
CREATIVE_logo.png		CREATIVE_logo.png
Creative_Pipelines.png		Creative_Pipelines.png
LICENSE		LICENSE
README.md		README.md
nsf.png		nsf.png
plt_logo.png		plt_logo.png
result_face_url_2022.csv		result_face_url_2022.csv
wmp-logo.png		wmp-logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CREATIVE --- Face_Url_Scraper_2022

Table of Contents

1. Video Tutorial

2. Overview

Results

Face Categories

Legislative

Executive

Judicial

Other Prominent Persons

Copyrights of Face Images

3. Setup

4. Thank You

About

Releases

Packages

Contributors 8

Languages

License

Wesleyan-Media-Project/face_url_scraper_2022

Folders and files

Latest commit

History

Repository files navigation

CREATIVE --- Face_Url_Scraper_2022

Table of Contents

1. Video Tutorial

2. Overview

Results

Face Categories

Legislative

Executive

Judicial

Other Prominent Persons

Copyrights of Face Images

3. Setup

4. Thank You

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages