-
Notifications
You must be signed in to change notification settings - Fork 10
2. Using Tika and Tesseract as an API exposed by Solr via ExtractingRequestHandler
Don't want to deploy a separate Tika server? But need Tika server like capabilities and you already have Solr? This is the solution for you!
First we figured out the magic incantation to configure Tika from inside of Solr, which is via a parseContext.config
parameter and a specific XML format:
<entries>
<entry class="org.apache.tika.parser.pdf.PDFParserConfig" impl="org.apache.tika.parser.pdf.PDFParserConfig">
<property name="extractInlineImages" value="true"/>
<property name="ocrStrategy" value="OCR_AND_TEXT_EXTRACTION"/>
</entry>
<entry class="org.apache.tika.parser.ocr.TesseractOCRConfig" impl="org.apache.tika.parser.ocr.TesseractOCRConfig">
<property name="outputType" value="HOCR"/>
<property name="language" value="eng"/>
<property name="pageSegMode" value="1"/>
</entry>
</entries>
You might be tempted to think that this is the same file format as a tika-config.xml
, and you'd be wrong ;-). While visually very similar, this file is loaded by ParseContextConfig, which is part of the Solr extraction contrib module. So yes, there are many different ways to specify configuration settings for PDF extraction and Tesseract OCR!
We then tweaked the default /update/extract
request handler to refer to the parseContext.xml
. We want any fields that we don't already have defined in solrconfig.xml
to be prepended with the name attr_
which triggers a dynamic
field generation. So if the field from Tika is Creator
, it becomes in Solr a text field called attr_creator
.
<requestHandler name="/update/extract"
class="solr.extraction.ExtractingRequestHandler" >
<str name="parseContext.config">parseContext.xml</str>
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">attr_</str>
<str name="multipartUploadLimitInKB">20480</str> Limit to 20 MB PDF
</lst>
</requestHandler>
Because PDFs can be big, we also needed to bump the size on the requestDispatcher
<requestDispatcher handleSelect="true" >
<requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="20480" formdataUploadLimitInKB="20480" />
</requestDispatcher>
You can now hit Solr via curl 'http://localhost:8983/solr/documents/update/extract?literal.id=doc2&commit=true&extractOnly=true' -F "myfile=@files/alvarez20140715a.pdf"
and get back from Solr the Tika processed content in a relatively easy to process structure!