Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add support for ALTO v3 and v4 #1117

Merged
26 changes: 24 additions & 2 deletions Classes/Format/Alto.php
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,19 @@
public function getRawText(\SimpleXMLElement $xml): string
{
$rawText = '';
$xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v2#');

Check notice on line 39 in Classes/Format/Alto.php

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

Classes/Format/Alto.php#L39

Whitespace found at end of line
michaelkubina marked this conversation as resolved.
Show resolved Hide resolved
// register ALTO namespace depending on document
$namespace = $xml->getDocNamespaces();
if (in_array('http://www.loc.gov/standards/alto/ns-v2#', $namespace, true)) {
$xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v2#');
}
if (in_array('http://www.loc.gov/standards/alto/ns-v3#', $namespace, true)) {
$xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v3#');
}
if (in_array('http://www.loc.gov/standards/alto/ns-v4#', $namespace, true)) {
$xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v4#');
}

michaelkubina marked this conversation as resolved.
Show resolved Hide resolved
// Get all (presumed) words of the text.
$strings = $xml->xpath('./alto:Layout/alto:Page/alto:PrintSpace//alto:TextBlock/alto:TextLine/alto:String');
$words = [];
Expand Down Expand Up @@ -68,7 +80,17 @@
*/
public function getTextAsMiniOcr(\SimpleXMLElement $xml): string
{
$xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v2#');
// register ALTO namespace depending on document
$namespace = $xml->getDocNamespaces();
if (in_array('http://www.loc.gov/standards/alto/ns-v2#', $namespace, true)) {
$xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v2#');
}
if (in_array('http://www.loc.gov/standards/alto/ns-v3#', $namespace, true)) {
$xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v3#');
}
if (in_array('http://www.loc.gov/standards/alto/ns-v4#', $namespace, true)) {
$xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v4#');
}
michaelkubina marked this conversation as resolved.
Show resolved Hide resolved

// get all text blocks
$blocks = $xml->xpath('./alto:Layout/alto:Page/alto:PrintSpace//alto:TextBlock');
Expand Down