Skip to content

WaunBroderick/Batch-OCR-Engine

Repository files navigation

Batch-OCR-Engine

Build Status GitHub PyPI - Python Version

Created by Waun Broderick

Written in Python3.6.6

Batch-OCR-Engine tool uses OCR (Optical Character Recognition) to first read in all the words from a document and then uses NLP (Natural Language Processing) to pick necessary information for future actions. Currently it's in the process of automating Request For Information processes where TD Bank sends detailed information to Government agencies. This is the first use case for Batch-OCR-Engine but it can be potentially used anywhere and everywhere a documents needs to be read as part of a business process.

Some future applications of Batch-OCR-Engine are in reading return mail addresses, automatically creating summaries from credit applications, checking accuracy and details from uploaded documents and much much more. The tool has simple intuitive User interface for users with range of digital literacy.

The first step of the tool is to read in clean high quality text data from any document using OCR. This utilizes Python libraries like OpenCV and Tesserect. The tool corrects for skew, brightness, sharpness, contrasts and color levels. The tool rechecks and modifies images not meeting internal quality control requirements utilizing Hyper Parameter tuning. The tool can understand 22 languages from English to Finnish. The tool automatically gets trained on new fonts by testing a limited amount of data. Finally Natural Language Tool Kit and Regular Expression are leveraged to pull information in structures reusable format. This information can then be utilized for the next step of process.

Batch-OCR-Engine has met the short term objective of demonstrating value of an AI based document reader but it will inculcate future advanced and differing requirements of different processes to become even more effective in future.

Dependencies

Graphical Componenets

  • Tkinter

Conversions & Computer Vision

  • Tesseract 4.0
  • Pillow 1.1.7
  • wand 0.4.4
  • PyPDF2 1.26.0
  • scipy 1.1.0
  • numpy 1.13.3
  • GhostScript 9.23 (32bit)
  • OpenCV 3.4.1

Deriving Understanding

  • nltk 3.3
  • csv 1.0
  • re 2.2.1

Algorithm Building

  • Threading
  • pooling
  • queueing

System Utilities

  • time
  • os
  • io
  • shutil
  • sys
  • math

Interaction Diagram

UML Diagram

About

Intelligent Processor of Documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages