Tablify is a Python-based tool that converts tabular data from images into CSV files using Optical Character Recognition (OCR). It processes images, extracts the text using pytesseract
, and organizes it into rows and columns for easy data extraction and analysis.
- Converts images of tables into structured CSV files.
- Uses
pytesseract
to perform OCR on images. - Processes images to detect individual text blocks, sort them by coordinates, and group them into rows.
-
Clone the repository:
git clone https://github.com/Preetraj2002/Tablify.git cd Tablify
-
Install required dependencies:
Make sure you have Python 3.x installed. Then, install the required libraries:
pip install -r requirements.txt
-
Install Tesseract OCR:
-
Windows: Download the Tesseract installer from here and add the path to your system environment variables.
-
Linux: Install Tesseract using:
sudo apt install tesseract-ocr
-
macOS: Use Homebrew to install Tesseract:
brew install tesseract
-
-
Prepare an Image: Ensure the image contains tabular data that you want to extract. The tool works best with clear, well-contrasted images.
-
Run the Script: After setting up, simply run the script on your image:
python tablify.py path/to/your/image.jpg
This will generate a
output.csv
file in the same directory. -
Check the Output: Open
output.csv
to see the extracted table data in tabular format.
-
Image Preprocessing: The image is converted to grayscale, and a binary thresholding is applied to make the text clearer for OCR.
-
Contour Detection: Using OpenCV, contours of the text blocks are identified to group text into rows and columns.
-
Text Extraction: Each text block is processed with
pytesseract
to extract the text, which is then organized into a structured CSV format. -
CSV Generation: The processed text is organized into rows based on vertical alignment and saved as a CSV file.
This project is licensed under the MIT License