Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple namespaces #488

Open
sebastian-meyer opened this issue Mar 4, 2020 · 8 comments
Open

Support for multiple namespaces #488

sebastian-meyer opened this issue Mar 4, 2020 · 8 comments
Labels
⚙ feature A new feature or enhancement.

Comments

@sebastian-meyer
Copy link
Member

Many formats used in the context of Kitodo reflect their versioning in the namespace (i. e. MODS, ALTO, IIIF, TEI). We therefore need to add default support for more than one version (although multiple versions could use the same parser).

This issue was raised by tesseract-ocr/tesseract#2815

@bertsky
Copy link

bertsky commented Jul 2, 2021

although multiple versions could use the same parser

To that: I completely agree, and it seems at least the ALTO parser currently does already tolerate multiple namespace versions. It would be great if the planned changes would hold that up as a principle.

(And strategically, I think it is also the best choice for data providers and workflows. We wouldn't want to discourage anyone from ingesting the best possible version of their fulltext – even if the current Presentation cannot make full use of it yet. Features like polygonal regions, angle/orientation, layout tags to differentiate different text region types and image region types, baseline curves, glyphs could all prove very valuable to future versions of Presentation. Moreover, they can be immediately useful for downstream applications of researches that simply download the fulltext.)

albig pushed a commit that referenced this issue Sep 21, 2021
This supports only ALTO 2.x.

There should be found a solution to support different namespaces. See #488
@bertsky
Copy link

bertsky commented Apr 24, 2023

@sebastian-meyer notified me (via other communication) that one necessary ingredient is v4 support in the Solr indexer.

So as long as this is not integrated/configured/tested, we still have to ingest v2, otherwise there will be full text highlighting, but no search.

@sebastian-meyer
Copy link
Member Author

sebastian-meyer commented Apr 24, 2023

To be precise: the problem here is not the indexer itself, because that just takes whatever text it gets. But we have parsers for every supported fulltext format (for ALTO it's https://github.com/kitodo/kitodo-presentation/blob/master/Classes/Format/Alto.php) and those are currently using hard-coded namespace URIs instead of the actual URIs configured in the format table.

Also, we have to take into account the solr_ocrhighlighting plugin we are using, because that interprets ALTO as well in order to get the word coordinates. I am not sure which ALTO versions are supported by this plugin.

@bertsky
Copy link

bertsky commented Apr 24, 2023

Also, we have to take into account the solr_ocrhighlighting plugin we are using, because that interprets ALTO as well in order to get the word coordinates. I am not sure which ALTO versions are supported by this plugin.

looks like it is version-agnostic as well.

@stweil
Copy link
Member

stweil commented Sep 12, 2023

UB Braunschweig creates ALTO 4.2 in their Kitodo-OCR-D workflow, so that seems to work fine.

@michaelkubina
Copy link
Collaborator

michaelkubina commented Dec 12, 2023

A quick solution in the ALTO case would be to change the way the namespace is registered in getRawText() or getTextAsMiniOcr() in the Classes/Format/Alto.php. We could simply check for the used Namespace of the ALTO file and register it correspondingly - like:

        // instead of this...
        //$xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v2#');

        //...we could use this
        $namespace = $xml->getDocNamespaces();

        if (in_array('http://www.loc.gov/standards/alto/ns-v2#', $namespace, true)) {
            $xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v2#');
        }
        
        if (in_array('http://www.loc.gov/standards/alto/ns-v3#', $namespace, true)) {
            $xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v3#');
        }
        
        if (in_array('http://www.loc.gov/standards/alto/ns-v4#', $namespace, true)) {
            $xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v4#');
        }

This way, we could at least quickly allow for the use of all ALTO file versions, instead of waiting until the multiple namespace issue is solved in general...which from what i see, is a way more complicated task.

After studying the changes in the ALTO schmema versions, the new schema versions after v2.1 do not change anything for getRawText(), since we only extract the text from the @CONTENT attribute.

For getTextAsMiniOcr() a possible issue could arise from the change of the type of HPOS, VPOS, WIDTH and HEIGHT after schema version 3.0, which is then a xsd:float instead of an xsd:int. Other then that we are still just interested in the Textline and String and simply do not use other features for now. Currently i have not been able to find ALTO files, where actual fractions are used for position attributes. Tesseract exports as ALTO v3 with whole integers and Kraken exports ALTO v4 with whole integers as well (https://kraken.re/4.0/ketos.html).

Testing

We are currently using SOLR 9.3.0 with the 0.8.3 version of the dbmdz.solr-ocrhighlighting module. Indexing the OCR from both ALTO v2 and v3 works fine with the proposed solution. I was not able to test it for ALTO v4, since we dont produce any such files at the moment. Highlighting (snippets) in the Listview does work as well - no issues here.

Maybe others would like to test as well?

@stweil
Copy link
Member

stweil commented May 15, 2024

it seems at least the ALTO parser currently OCR-D/core#544 (comment) multiple namespace versions

UB Braunschweig creates ALTO 4.2 in their Kitodo-OCR-D workflow, so that seems to work fine.

We just noticed that ony ALTO v2 is fully supported!

While Kitodo.Presentation is able to show the fulltext for v2 and newer ALTO versions (JavaScript code), it fails to add that fulltext to its search index (PHP code) as we noticed this week.

@michaelkubina
Copy link
Collaborator

I have made a PR a while ago, that allows for indexing of alto 2, 3 and 4 - see #1117 for it. It was tested with solr 9.3.0 and the ocr-module in version 0.8.3

We use it this way in our current production environment. Its just a small change to the alto-parser and mini-ocr and registers the individual namespace of the alto file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚙ feature A new feature or enhancement.
Projects
None yet
Development

No branches or pull requests

4 participants