-
Notifications
You must be signed in to change notification settings - Fork 10
1. Using Tika and Tesseract Outside of Solr
Extracting content from file formats using Tika as a standalone service is the traditional approach, and what this demo project is built around. You can try it out online at http://pdf-discovery-demo.dev.o19s.com:8080/.
Follow the Quickstart in the README to run this locally. To see all the steps, follow the Text Extraction instructions. In a nutshell, there is an extraction script that calls Tika to extract from a PDF the information we need.
A couple things that are interesting:
-
It's super simple to swap between Tika the CLI app and Tika the Server process. The nice thing about using the
tika-app.jar
is that all your dependencies for parsing are packaged up into one 78 MB file. Very easy to include that in your project. However, if you are going for scale, then you might want to run a cluster of tika-server processes with a load balancer in front, and then you would want to swap to making acurl
command against a deployed tika-server. -
Deploying Tika server in dockerized world is super simple: https://github.com/o19s/pdf-discovery-demo/blob/master/docker-compose.yml#L47. However, I do wish the Apache Tika project had a official image that was released every time Tika was released. ;-)
-
I'm very happy to report that in Tika-1.23, you can now configure the PDF and OCR Parsers via a single
tika-config.xml
file. In 1.22 and earlier, you needed to have on the filesystem a./tika-properties/
directory that was included in the classpath. You can see an old commit where this was done: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser. It was awkward! Now you can use your tika-config.xml to set everything. -
I learned about the magic header parameters that you can send to the Tika server to configure your parser. This is an alternative to either the properties file configuration or the
tika-config.xml
configuration. It's cool, but also some more magic.. For example, the parameters don't follow any pattern of naming them.curl -T ./path/mypdf.pdf http://pdf-discovery-demo.dev.o19s.com:9998/rmeta --header "X-Tika-OCRLanguage: eng" --header "X-Tika-PDFOcrStrategy: ocr_and_text_extraction" --header "X-Tika-OCRoutputType: hocr"
-
Parsing out the HOCR output looks daunting at first, till you realize you just care about
<span class="ocrx_word">
tagged content. Check out https://github.com/o19s/pdf-discovery-demo/blob/master/ocr/extract.ps1#L63 to see both the HOCR pulled out of the XML as well as the raw text pulled out. -
You can store lots of different data in your payloads! We have the bounding box from HOCR, but also store the page number as well in the payload. Base64 encode it all to store in Solr.
-
We did a crazy thing to allow us to do traditional highlighting of snippets of text in our SERP page, but then link each snippet to the highlights in the PDF document, even though there was no explicit connection... We track the offset of our highlights, and pass that along in the front end, in order to give the front end additional data to narrow down the payload highlighting. We did this via custom formatter which injects into the response additional data:
<em data-num-tokens="1" data-score="1.0" data-end-offset="2110" data-start-offset="2079">HELOCs|NiAzNTEgNTAyIDQwNSA1Mjc=</em>
. Learn more by looking at the Solr Payload Component from https://github.com/o19s/payload-component and the Offset Highlighter Component from https://github.com/o19s/offset-hl-formatter.