Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add support for ALTO v3 and v4 #1117

Conversation

michaelkubina
Copy link
Collaborator

This PR makes the ALTO Parser agnostic of the actual ALTO schema version in the provided file. It checks the document namespaces and registers it accordingly. This way we can index ALTO files in the schema versions 2, 3 and 4.

I have testet it in our development enviroment with SOLR 9.4.0 and the ocr-highlighting module in version 0.8.3. I have not encountered any issues here. It has been rolled out in our productive enviroment as well.

While it does not resolve the general namespace issue, it does at least resolve the ALTO issue.

This is related to #488 .

@csidirop
Copy link
Contributor

csidirop commented May 16, 2024

I would suggest adding a feedback for future "unsupported" versions. This could be a log entry or writing direct to the command line (via the SymfonyStyle). This way at least we know that there is something off.

Proposal: Add an else-block which catches all other cases and handle some form of feedback.

Classes/Format/Alto.php Outdated Show resolved Hide resolved
Classes/Format/Alto.php Outdated Show resolved Hide resolved
Classes/Format/Alto.php Outdated Show resolved Hide resolved
@BFallert
Copy link
Collaborator

BFallert commented May 16, 2024

I have testet it in our development enviroment with SOLR 8.11.2.
I have tested documents with ALTO 2.x / 3.x / 4.x.
All tests were successful. The data arrived in the solr dataset in all cases.

michaelkubina and others added 2 commits May 17, 2024 08:53
Co-authored-by: Stefan Weil <sw@weilnetz.de>
Co-authored-by: Christos Sidiropoulos <csidirop@runbox.com>
@michaelkubina
Copy link
Collaborator Author

as proposed by @stweil i have moved the namespace registration to a private function to reduce code duplication.

...

I would suggest adding a feedback for future "unsupported" versions. This could be a log entry or writing direct to the command line (via the SymfonyStyle). This way at least we know that there is something off.

Proposal: Add an else-block which catches all other cases and handle some form of feedback.

while i like this idea, i currently have no time to look into the depths of how to achieve this. but we should keep that in mind for the future. looking at what ocr engines output currently, we cover the necessary schemas

add Alto formats 3 and 4 in documentation
Copy link
Member

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@sebastian-meyer sebastian-meyer changed the title [FEATURE] Make ALTO parser register namespaces individually [FEATURE] Add support for ALTO v3 and v4 May 21, 2024
@sebastian-meyer sebastian-meyer merged commit cfb46e8 into kitodo:master May 21, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚙ feature A new feature or enhancement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants