Status: [In-Progress]
Published: [date here]
Updated: [9/22]
This Python script uses the tabula library to read a PDF, build it into a dataframe, and export it as a csv. For pdfs with multiple tables, the script outputs each table as separate sheets.
- After cloning, cd into backend and run app.py
cd backend
python app.py
- Now, in a separate terminal window, cd into frontend and enter npm start
cd frontend
npm start
- The webpage will load up (likely at local host 3000). Upload PDFs as required!
- Once finished, you can stop the servers by hitting "Ctrl-C" in your terminals.
-
Add your PDF file to the repository. You can do this by dragging and dropping the file into the folder.
-
Add an empty Excel file to the repository. You can do this by right-clicking on the file explorer and selecting New File. Name the file with the .xlsx extension.
-
Run the Python script using the command
python pdf_to_excel.py
- Example PDFs
- React-Typescript, perhaps with Flask for the Python backend
- Will assign components; however, frontend is focus for now
/pdf-parser-tool
├── /backend
│ ├── app.py # python script
│ ├── requirements.txt # dependencies for python (e.g. tabula-py, pandas, etc.)
│ └── ... # other backend files
├── /frontend
│ ├── /public # public assets (index.html, favicon, etc.)
│ ├── /src
│ │ ├── /components # react components (e.g., UploadForm, TableView)
│ │ ├── /hooks # custom hooks (for API calls, etc.)
│ │ ├── /styles # CSS
│ │ ├── App.tsx # main react component
│ │ ├── index.tsx # entry point for react
│ │ └── api.ts # API functions to interact w/ backend
│ ├── package.json # dependencies for frontend (react, typescript, etc.)
│ └── tsconfig.json # typescript config
└── README.md # project docs
- Doesn't need to store PDFs, but may need to turn Excel sheets into Google Sheets
- Google Sheets API https://developers.google.com/sheets/api/guides/concepts
(Will be workshopped -- consider a requirements.txt)
- pip install tabula-py
- pip install JPype1
- Install Java 64-Bit @ https://www.java.com/en/download/manual.jsp
- Add it to your environment variabls
- Add it to your path (%JAVA_HOME%\bin)
- e.g. "(C:\Program Files (x86)\Java\jre1.8.0_421)"
- pip install openpyxl
install tree (mac example shown)
brew install tree
use tree command in terminal to generate
tree -I 'node_modules|.git' --dirsfirst | pbcopy
- 9/12: Deploy PDF Script
- Front-end Design | [Name], [Name]
- Back-end Design | [Name], [Name]
Thank you to [credit any inspiration, open source code, or advisors] for [X].
For questions, comments or curiosities:
- Hustler staff: Slack the #data team.
- The rest of the 🌎: email Data Editor Katherine Oung