-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multiple namespaces #488
Comments
To that: I completely agree, and it seems at least the ALTO parser currently does already tolerate multiple namespace versions. It would be great if the planned changes would hold that up as a principle. (And strategically, I think it is also the best choice for data providers and workflows. We wouldn't want to discourage anyone from ingesting the best possible version of their fulltext – even if the current Presentation cannot make full use of it yet. Features like polygonal regions, angle/orientation, layout tags to differentiate different text region types and image region types, baseline curves, glyphs could all prove very valuable to future versions of Presentation. Moreover, they can be immediately useful for downstream applications of researches that simply download the fulltext.) |
This supports only ALTO 2.x. There should be found a solution to support different namespaces. See #488
@sebastian-meyer notified me (via other communication) that one necessary ingredient is v4 support in the Solr indexer. So as long as this is not integrated/configured/tested, we still have to ingest v2, otherwise there will be full text highlighting, but no search. |
To be precise: the problem here is not the indexer itself, because that just takes whatever text it gets. But we have parsers for every supported fulltext format (for ALTO it's https://github.com/kitodo/kitodo-presentation/blob/master/Classes/Format/Alto.php) and those are currently using hard-coded namespace URIs instead of the actual URIs configured in the format table. Also, we have to take into account the solr_ocrhighlighting plugin we are using, because that interprets ALTO as well in order to get the word coordinates. I am not sure which ALTO versions are supported by this plugin. |
looks like it is version-agnostic as well. |
UB Braunschweig creates ALTO 4.2 in their Kitodo-OCR-D workflow, so that seems to work fine. |
A quick solution in the ALTO case would be to change the way the namespace is registered in
This way, we could at least quickly allow for the use of all ALTO file versions, instead of waiting until the multiple namespace issue is solved in general...which from what i see, is a way more complicated task. After studying the changes in the ALTO schmema versions, the new schema versions after v2.1 do not change anything for For TestingWe are currently using SOLR 9.3.0 with the 0.8.3 version of the Maybe others would like to test as well? |
We just noticed that ony ALTO v2 is fully supported! While Kitodo.Presentation is able to show the fulltext for v2 and newer ALTO versions (JavaScript code), it fails to add that fulltext to its search index (PHP code) as we noticed this week. |
I have made a PR a while ago, that allows for indexing of alto 2, 3 and 4 - see #1117 for it. It was tested with solr 9.3.0 and the ocr-module in version 0.8.3 We use it this way in our current production environment. Its just a small change to the alto-parser and mini-ocr and registers the individual namespace of the alto file. |
Many formats used in the context of Kitodo reflect their versioning in the namespace (i. e. MODS, ALTO, IIIF, TEI). We therefore need to add default support for more than one version (although multiple versions could use the same parser).
This issue was raised by tesseract-ocr/tesseract#2815
The text was updated successfully, but these errors were encountered: