Merge pull request #11 from BU-Spark/documentation

Documentation
BU-Spark · Dec 10, 2024 · 806a28b · 806a28b
2 parents 776b9d8 + e499527
commit 806a28b
Show file tree

Hide file tree

Showing 26 changed files with 590 additions and 298 deletions.
diff --git a/F24-media/drains-ui.png b/F24-media/drains-ui.png
diff --git a/modules/deed_preprocessing/eda.pdf → F24-media/eda.pdf b/modules/deed_preprocessing/eda.pdf → F24-media/eda.pdf
diff --git a/F24-media/pipeline.png b/F24-media/pipeline.png
diff --git a/project-outline.md → F24-media/project-outline.md b/project-outline.md → F24-media/project-outline.md
diff --git a/research.md → F24-media/research.md b/research.md → F24-media/research.md
diff --git a/README.md b/README.md
diff --git a/S24-SUM24-Docs.md b/S24-SUM24-Docs.md
@@ -0,0 +1,109 @@
+# DRAINS: Deed Restriction Artificial Intelligence Notification System
+SPARK! x MassMutual Data Days for Good
+
+Created by Alessandra Lanz, Sahir Doshi, Cindy Zhang, Vijay Fisch, Sindhuja Kumar, Naman Nagaria, Valentina Haddad
+
+## Project Overview
+This project, developed for the [Longmeadow Historical Society](https://www.longmeadowhistoricalsociety.org), introduces an automated tool designed to identify racist restrictions within historical property deeds. Utilizing advanced text analysis techniques, the program processes TIFF images of property deeds, evaluates the text for racist content, and extracts critical information—specifically the deed date and page number—into a CSV format for efficient access and analysis.
+
+### Key Features
+
+- Image Processing: Accepts property deed images in TIFF format.
+- Content Analysis: Employs text recognition and analysis algorithms to detect racist language.
+- Data Extraction: Automates the extraction of deed date and page number for each document analyzed.
+
+Our aim is to assist the Longmeadow Historical Society in their efforts to document and understand historical injustices, contributing to a broader societal recognition and rectification of past discriminations.
+
+### Dataset Used
+The historical property deeds (mainly 1900s) of Massachusetts.
+
+## Quick Start
+### Requirements
+Install essential libraries:
+```
+pip install -r requirements.txt
+```
+
+### Set up OpenAI_API_KEY
+In folder `modules`: 
+
+1. Duplicate the file `env.template`
+
+2. Add your `api key` and `organization id` to `OPENAI_API_KEY` and `OPENAI_ORG_ID`. You can get your api key and organization ID via the link: https://platform.openai.com/api-keys, 
+https://platform.openai.com/account/organization
+
+3. Rename this file to `.env`
+
+> For different ChatGPT versions, you can change the `model` parameter in `racist_chatgpt_analysis.py`.   
+It's on line 13:
+`model="gpt-3.5-turbo"`  
+To access ChatGPT-4, you can update this line to:
+`model="gpt-4-0125-preview"`
+
+### Run the code
+In file `main.py`, change the folder path to your path(line 36).
+```python
+racism_threshold('/Your/Path/To/Files')
+```
+For the **Windows Operating System**, you need to edit the path manually to make sure all slashes are **backslashes**. 
+
+Then in command line, run:
+```
+python main.py
+```
+
+## Modules Overview
+
+`OCR.py`: Employs Google's OCR (Optical Character Recognition) technology, via the PyTesseract library, to convert deed images in TIFF format to searchable and analyzable text.
+
+`bigotry_dict.py`: Contains a hardcoded dictionary of terms associated with racist language that is used to scrutinize the deed text for potential matches.
+
+`locate.py`: Utilizes PyTesseract OCR to identify and extract specific information from the deed text, such as the deed date, book of origin, and page number.
+
+`racist_chatgpt_analysis.py`: Integrates with OpenAI's ChatGPT API to process the text-based deeds for advanced racism detection, offering a nuanced analysis that goes beyond keyword matching.
+
+`racist_text_query.py`: A failsafe text query module that acts as a backup for the ChatGPT analysis, manually checking deeds against the bigotry dictionary to ensure no instances of racist language are overlooked.
+
+`pagenum.py`: A failsafe page number extraction module that acts as a backup for the data extraction done by `locate.py` by cropping the corners of the image for enlargement and easy OCR translation. 
+
+
+# PIT-NE x SPARK! x MassMutual ~ SUMMER 2024 
+
+Created by Arnav Sodhani, Grace Chong, Hannah Choe
+
+## Project Overview
+
+This project is developed for the Longmeadow Historical Society and is a direct continuation of the work done by SPARK! x MassMutual Data Days for Good. 
+
+This interactive map shows the temporal progression of racist deeds in a neighborhood in Longmeadow, MA in the early 1900s. To make this product, we first utilized the MA registry and filtered off of Hampden County to get our data. On the registry website, we filtered out property deeds based on whether the seller was E.H. Robbins (this builder was infamous for placing racial restrictions in deeds) and the period was the early 1900s (MA passed the Fair Housing Act in 1946 so we had to look for deeds prior that year). We then organized and consolidated these deeds in a spreadsheet. Next, we filtered whether the deed had any racial restrictions and then normalized the spreadsheet so that each "lot number" had its row. Lastly, to finalize the creation of this database, we added two new columns: Address Today (we matched each "lot number" to its respective current-day address using GIS technology and the Longmeadow lot plan) and House Image (we matched each address to its house image using Google Maps). And then finally we used ArcGIS, software that helps to build web maps, to create our end deliverable of an interactive map that visualizes the data with a temporal aspect (time slider).
+
+### Key Features 
+
+-Cover and information page 
+
+-Time slider of the existence of deeds 
+
+-House icon: information on property deed 
+
+-Address search tool
+
+-Filter tool for racial groups 
+
+### Process
+
+Data Collection - 
+
+There were issues running the code from SPARK! x MassMutual Data Days for Good. We decided to manually collect data from the Hampden County Registry of Deeds in Longmeadow, MA with E.H Robbins as the grantor. 
+
+Data Cleaning and Transformation -
+1. After manually collecting the data of racist deeds in Longmeadow with E.H Robbins as the grantor, we normalized the Lot # column to ensure that each Lot # has a unique row; this is so, because deeds may have multiple lot #s. 
+2. Then we matched the lot # to the modern-day addresses using an existing Longmeadow GIS. 
+3. Finally, we created a reference table with ID keys to map onto our GIS. 
+
+
+### Visualization 
+
+We chose ArcGIS for our visualization software (interactive map). 
+
+Link to the ArcGIS instant app:
+https://arcg.is/1aKD9b1
diff --git a/app.py b/app.py
@@ -15,10 +15,10 @@
 # CORS(app, resources={r"/*": {"origins": "*"}})
 CORS(app, supports_credentials=True, origins="*")
 
-with open('modules/model_experimentation/vectorizer.pkl', 'rb') as vec_file:
+with open('modules/models/vectorizer.pkl', 'rb') as vec_file:
     vectorizer = pickle.load(vec_file)
 
-with open('modules/model_experimentation/logistic_model.pkl', 'rb') as model_file:
+with open('modules/models/logistic_model.pkl', 'rb') as model_file:
     logistic_model = pickle.load(model_file)
 
 # Helper to look for the book and page numbers

diff --git a/dataset-documentation/DATASETDOC-fa24.md b/dataset-documentation/DATASETDOC-fa24.md
@@ -0,0 +1,39 @@
+### What is the project name?
+
+MassMutual: Racist Deeds
+
+### What is this project about? What is the goal of this project?
+
+Making a data pipeline to classify scanned deeds as racist or not.
+
+### What data sets did you use in your project? Please provide the link.
+
+https://drive.google.com/drive/folders/1V9x-24SeIQlAyOeVQRXbRElQaw_ig6il
+
+### Please provide a description of the data set, how it was collected, and how it was cleaned and processed.
+
+It is a collection of housing deeds scanned into TIFF files.
+
+### Did you use or create any data dictionaries for the data set in this project?
+
+No
+
+### Did the client put restrictions on this data?
+
+No
+
+### What is the data being used for? Please briefly explain the goal of the project.
+
+To train a classification model to detect racist clauses in the data.
+
+### Is there missing data from the client or additional data that needs to be collected by another team?
+
+Yes, there is very little ground truth data. We need more labelled data, or to generate data synthetically.
+
+### Who is the client for the project?
+
+MassMutual and Longmeadow Historical Society
+
+### Are there any limitations to the data you used? Are there specific use cases where the data should not be used?
+
+There is very little ground truth data.
diff --git a/drain/src/App.jsx b/drain/src/App.jsx
@@ -3,6 +3,8 @@ import DragDropArea from "./components/DragDropArea";
 import "./App.css";
 import Banner from "./components/Banner";
 
+const BACKEND_URL = "https://spark-ds549-f24-racist-deeds.hf.space"
+
 const bigotryTerms = new Set([
   "irishman", "greek", "portugese", "mulatto", "quadroon", "chinaman", "jap", "hebrew", 
   "pole", "french canadian", "canadien", "quebecois", "arab", "turk", "frenchman", "german", 
@@ -111,7 +113,7 @@ const App = () => {
 
         // Reads from HuggingFace backend
         const response = await fetch(
-          "https://spark-ds549-f24-racist-deeds.hf.space/api/upload",
+          `${BACKEND_URL}/api/upload`,
           {
             method: "POST",
             body: formData,
@@ -142,7 +144,7 @@ const App = () => {
 
   const handleDownloadExcel = async () => {
     const response = await fetch(
-      "https://spark-ds549-f24-racist-deeds.hf.space/api/download_excel",
+      `${BACKEND_URL}/api/download_excel`,
       {
         method: "POST",
         headers: {

diff --git a/modules/azure_cloud_ocr/setup.md → modules/azure_cloud_ocr/README.md b/modules/azure_cloud_ocr/setup.md → modules/azure_cloud_ocr/README.md
diff --git a/modules/deed_preprocessing/keyword_dect2.py b/modules/deed_preprocessing/keyword_dect2.py
diff --git a/modules/deed_preprocessing/readme.md b/modules/deed_preprocessing/readme.md
@@ -1,5 +1,66 @@
-1. Download a zip of tiffs named tiffs.zip, and put it in the deed_preprocessing directory
-2. Run read_tiffs.py to get a directory full of text outputs from Google Cloud OCR
-3. The file preprocessor.py uses SpaCy to parse the sentences into objects
+# Deed Preprocessing Module
+
+## preprocessor.py
+
+Preprocessor accepts a string which should be the text output of an OCR model. It then calls spaCy NLP, and parses metadata from the returned object. The following loop handles the parsing:
+
+```python
+for sent in doc.sents:
+        result["sentences"].append(sent.text)
+        result["sentence_lengths"].append(len(sent))
+
+        for token in sent:
+            pos = token.pos_
+            all_tokens.append(token.text)
+
+            if pos in pos_groups:
+                pos_groups[pos].append(token.text)
+
+            result["dependencies"].append({
+                "token": token.text,
+                "dep": token.dep_,
+                "head": token.head.text
+            })
+            result["token_offsets"].append({
+                "token": token.text,
+                "start": token.idx,
+                "end": token.idx + len(token.text)
+            })
+```
+
+See eda.ipynb for more analysis on these objects.
+
+## read_tiffs.py
+
+This module can be used to read TIFFs from a ZIP file and store result in a directory of text outputs called /outputs. Do the following steps:
+
+- Make sure you set up Google Cloud OCR credentials first, see the README in../google_cloud_ocr to do this.
+- Download a ZIP of TIFFs from the SCC or Google Cloud and rename it tiffs.zip. Put it in the deeds_preprocessing directory
+- Run the script and see the output text files in /outputs, note that this will use Google Cloud credits
+- Use preprocessor.py to structure the text files into spaCy objects
+
+These steps are all done in eda.ipynb for further clarity.
+
+## read_all_tiffs.py
+
+This script is designed to be run in the SCC and submitted as a job with the accompanying file "ocr_deeds.sh". It loops through the specified folders within the SCC that contain deed TIF files and retrieves the OCR text of these deeds. Then it utilizes "bigotry-dict" which contains a set of national identifiers and checks if any word within the OCR text matches with any of these identifiers. If any of the words match, the file is put in ./racist and if not it is put in ./outputs. 
+To run this file:
+-Make sure your Google Cloud OCR credentials are setup, see the README in../google_cloud_ocr to do this.
+-Change the folders to be run in line 33 to the next set of desired folders
+-Can check which folders have been OCRed by going to /outputs and checking the first 6 numbers before the dash, which will be the most recent folder OCRed
+-submit an SCC job by "qsub ocr_deeds.sh"
+
+## reset_racist_dir.sh
+
+This script is designed to be run in the SCC and submitted as a job with the accompanying file "reset_racist_dir.sh". It is intended to be used to update the /racist folder after making changes to the "bigotry-dict". It moves all the files from /racist to /outputs and then it loops through every .txt file in /outputs and moves that file back to /racist if any of the words match a word within the "bigotry-dict".
+To run this file simply do "qsub reset_racist_dir.sh" in the SCC
+
+## batch_process_racist_dir.py
+
+This script is designed to make an OpenAI API call on each deed stored within the /racist folder and ask ChatGPT whether there is racist language within this document. This is useful for finding documents that actually contain racist restrictions that can be used as ground truth for the logistic regression model, or future models. Make sure OpenAI credentials are setup in advance.
+
+## spellcheck.py
+
+Script to use the autocorrect library to improve extracted ocr texts.
+
 
-The above steps are all completed in eda.ipynb, which performs some preliminary analysis on the data objects.
diff --git a/modules/google_cloud_ocr/setup.md → modules/google_cloud_ocr/README.md b/modules/google_cloud_ocr/setup.md → modules/google_cloud_ocr/README.md
@@ -1,25 +1,19 @@
 # Google Cloud OCR Setup with Python
 
----
-
 ## Step 1: Create a Google Cloud Project
 
 1. Go to the [Google Cloud Console](https://console.cloud.google.com/).
 2. **Create a New Project**:
    - In the top-left corner, click the project dropdown menu and then "New Project."
    - Enter a project name (e.g., "OCR Project") and click "Create."
 
----
-
 ## Step 2: Enable the Cloud Vision API
 
 1. In the **Google Cloud Console**, go to the **Navigation Menu** (three horizontal lines at the top left).
 2. Click on **APIs & Services** > **Library**.
 3. In the search bar, type **Vision API**.
 4. Select **Cloud Vision API** and click **Enable**.
 
----
-
 ## Step 3: Create Service Account Credentials
 
 1. Navigate to **APIs & Services** > **Credentials**.
@@ -35,8 +29,6 @@
    - Select "Manage Keys" > "Add Key" > "Create New Key."
    - Choose **JSON** format and download the JSON file. This file contains your credentials.
 
----
-
 ## Step 4: Set Up the `.env` File
 
 1. Create a new file named `.env` in the root directory of your Python project.
@@ -45,8 +37,7 @@
    ```bash
    GOOGLE_APPLICATION_CREDENTIALS=/path-to-your-credentials.json
    ```
-
----
+   We recommend setting the path as ./credentials/google-cloud.json
 
 ## Step 5: Running the script