This is a fully-functioning sample repository showing:
- how to use an external OCR provider (in this case Amazon Textract).
- upload the resulting PDFs into tagtog.
The code is written in Java (11).
This code starts from an Amazon Textract Tutorial (original code) to OCR input files (PDFs or images) and convert them into "searchable PDFs" (i.e. PDFs with embedded text). These "searchable PDFs" are exactly what we want to upload to tagtog to then annotate them using tagtog Native PDF.
This respository adds additional utilities (e.g. traversing & processing recursively given directories) and using the tagtog Documents APIs to upload the results to a given tagtog project. Http requests are done with java, Apache HttpClient (4.5).
The main entry point is DemoTagtogOcr.java. The main ingredients of the code are 3:
- Call Amazon Textract API
- Translating the JSON output from Amazon Textract into a "searchable PDF" (with java pdfbox)
- Call the tagtog API to upload documents
git clone https://github.com/tagtog/java-ocr-amazon-textract-searchable-pdf.git
cd java-ocr-amazon-textract-searchable-pdf/src/SearchablePDF/
./compile.sh
# Set your tagtog credentials
export TAGTOG_USERNAME=???
export TAGTOG_PASSWORD=???
# export TAGTOG_DOMAIN=??? # optionally, override the tagtog domain, for example if you are running tagtog OnPremises
time ./run.sh MY_TAGTOG_OWNERNAME MY_TAGTOG_PROJECT MY_TAGTOG_FOLDER ...inputFilesOrDirectories
If you are new to AWS or unsure about the details, this is the complete AWS guide to get started with Amazon Textract.
In short, what you need is:
- Make sure you have an IAM user with
AmazonTextractFullAccess
permissions & with an access key. - Configure your local aws credentials, with the
[default]
role pointing to that IAM user and also set your desiredregion
.
Using this very same code, we OCR'ed the FUNSD dataset and uploaded the results into the tagtog public project: tagtog/FUNSD-OCRed π.
We exactly ran (last update on 2021-04-20):
time ./run.sh tagtog FUNSD-OCRed testing_data ~/Downloads/dataset/testing_data/ # took around ~2m; 50 docs in total
time ./run.sh tagtog FUNSD-OCRed training_data ~/Downloads/dataset/training_data/ # took around ~6m; 149 docs in total
These are some sample annotated documents in tagtog.
The original demo code tends to create oversized PDFs and to write the embedded character offsets a little bit below the actual (visual) positions. These details can be tweaked and of course depend on the used OCR software.