zen_document_parser is a utility for extracting data from various official documents. It uses PDFQuery behind the scenes.
Currently, there is out-of-the-box support for parsing Indian Government ITR-V PDF documents.
The library also supports parsing of arbitrary PDF documents by allowing you to specify a 'schema' for the document. The library allows for multiple 'variants' of a document. For example, The Indian ITR-V document has slightly different fields and layout depending on whether it was generated in 2013, 2014, 2015 etc.
Check out the examples below.
Install using pip like so:
$ pip install zen_document_parser
from zen_document_parser.itr.itr import ITRVDocument
# You can pass in a path or a file-like object during instantiation.
doc = ITRVDocument('/path/to/itrv.pdf')
# Will load the file, auto-detect the variant and perform extraction of all
# fields and store results internally.
doc.extract()
# Extracted fields are available in the `data` property.
print(doc.data.company_name)
print(doc.data.gross_total_income)
You basically follow these steps:
- Define one or more 'schemas', ie.
DocVariant
subclasses, to go with each variant of the doc. - In each of these variants, define a
check_for_match()
method that returnsTrue
if a file was successfully parsed.- Make sure to define
test_fields
as an attribute on each class that is a list of all field names used insidecheck_for_match()
. (This is required at present for optimization purposes, but will not be a requirement in an upcoming version.)
- Make sure to define
- Define a
Doc
subclass that represents your document. In thevariants
attribute, specify possible variants.
from zen_document_parser.base import DocField, DocVariant, Document
class Variant1(DocVariant):
# The fields that are used inside `check_for_match()`. (for optimization)
test_fields = ['form_title']
form_title = DocField((30, 300, 500, 380))
name = DocField((100, 120, 400, 140.5))
address = DocField((150, 90, 650, 110))
def check_for_match(self):
if self.form_title == 'Application Form For 2014':
return True
return False
class Variant2(DocVariant):
test_fields = ['form_title']
form_title = DocField((30, 290, 500, 380))
name = DocField((70, 140, 350, 160))
address = DocField((150, 120, 650, 140))
pan_no = DocField((150, 80, 650, 100))
def check_for_match(self):
if self.form_title == 'Application Form For 2015-16':
return True
return False
class MyForm(Document):
variants = [Variant1, Variant2]
def main():
doc = MyForm('/path/to/form.pdf')
doc.extract()
print(doc.data.to_dict())
- Hanle data-type specification
- Handle fields being mandatory/non-mandatory.
- Right now the user has to explicitly specify
test_fields
for optimization purposes. Find a way where this isn't needed.- Automatically load them the first time they're referred to?
extract()
can still be there as a way to bulk-load all fields in one go.
- Automatically load them the first time they're referred to?