This Python project scrapes raw PDF data containing MHT CET college and branch cutoffs, extracts the relevant information, and creates a JSON file. Additionally, it generates a "skipped" folder with pageNo.txt
files for lines that couldn't be understood and are excluded from the JSON data. The final output is an Excel file (output.xlsx
) containing organized cutoff data.
- Run
main.py
. - Provide the path to the MHT CET cutoff PDF file.
- This will create data.json file
- Next, run
DataMigrater.py
to create the final(output.xlsx
) Excel file.
sudo apt-get update
sudo apt-get install python3-pip
pip3 install pypdf openpyxl
- Install Python 3.x from the official website: Python Downloads.
- Open a command prompt (cmd) or PowerShell.
- Run the following commands:
pip install pypdf openpyxl
- Install Python 3.x (if not already installed) using Homebrew or the official website.
- Open Terminal.
- Run the following commands:
pip3 install pypdf openpyxl
Feel free to contribute or report issues on GitHub!
The out
folder in this repository contains the following files:
-
Sample PDF (2023 CET CAP Round 1 Cut-off): You can find the raw PDF file containing MHT CET college and branch cutoffs for the 2023 CAP Round 1. This is the input file that the Python program processes.
-
Final Output (output.xlsx): After running the
main.py
script and executing the data extraction process, the program generates an Excel file namedoutput.xlsx
. This file contains organized and structured cutoff data for colleges and branches.