Skip to content

A tool for Vanderbilt Hustler staffers to convert PDFs to spreadsheets.

Notifications You must be signed in to change notification settings

VanderbiltHustler/PDFParser

Repository files navigation

PDFParser

Status: [In-Progress]
Published: [date here]
Updated: [9/22]

Vanderbilt Hustler internal tool to convert PDFs to spreadsheets!

How this tool works

This Python script uses the tabula library to read a PDF, build it into a dataframe, and export it as a csv. For pdfs with multiple tables, the script outputs each table as separate sheets.

[Fixed] How to use this tool

  1. After cloning, cd into backend and run app.py
cd backend
python app.py
  1. Now, in a separate terminal window, cd into frontend and enter npm start
cd frontend
npm start
  1. The webpage will load up (likely at local host 3000). Upload PDFs as required!
  2. Once finished, you can stop the servers by hitting "Ctrl-C" in your terminals.

How to use this tool [old]

  1. Add your PDF file to the repository. You can do this by dragging and dropping the file into the folder.

  2. Add an empty Excel file to the repository. You can do this by right-clicking on the file explorer and selecting New File. Name the file with the .xlsx extension.

  3. Run the Python script using the command

python pdf_to_excel.py

Things being worked through/considered

/pdf-parser-tool
├── /backend
│   ├── app.py                   # python script
│   ├── requirements.txt         # dependencies for python (e.g. tabula-py, pandas, etc.)
│   └── ...                      # other backend files
├── /frontend
│   ├── /public                  # public assets (index.html, favicon, etc.)
│   ├── /src
│   │   ├── /components          # react components (e.g., UploadForm, TableView)
│   │   ├── /hooks               # custom hooks (for API calls, etc.)
│   │   ├── /styles              # CSS
│   │   ├── App.tsx              # main react component
│   │   ├── index.tsx            # entry point for react
│   │   └── api.ts               # API functions to interact w/ backend
│   ├── package.json             # dependencies for frontend (react, typescript, etc.)
│   └── tsconfig.json            # typescript config
└── README.md                    # project docs

Possible Requirements

(Will be workshopped -- consider a requirements.txt)

  • pip install tabula-py
  • pip install JPype1
  • Install Java 64-Bit @ https://www.java.com/en/download/manual.jsp
  • Add it to your environment variabls
  • Add it to your path (%JAVA_HOME%\bin)
  • e.g. "(C:\Program Files (x86)\Java\jre1.8.0_421)"
  • pip install openpyxl

Directory

install tree (mac example shown)

brew install tree

use tree command in terminal to generate

tree -I 'node_modules|.git' --dirsfirst | pbcopy

Deployment History

  • 9/12: Deploy PDF Script

Credits

  • Front-end Design | [Name], [Name]
  • Back-end Design | [Name], [Name]

Thank you to [credit any inspiration, open source code, or advisors] for [X].

Powered by The Vanderbilt Hustler Data Team

For questions, comments or curiosities:

  • Hustler staff: Slack the #data team.
  • The rest of the 🌎: email Data Editor Katherine Oung

About

A tool for Vanderbilt Hustler staffers to convert PDFs to spreadsheets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published