Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line height parameters broke hOCR output #225

Closed
tfmorris opened this issue Feb 15, 2016 · 0 comments
Closed

Line height parameters broke hOCR output #225

tfmorris opened this issue Feb 15, 2016 · 0 comments
Labels

Comments

@tfmorris
Copy link
Contributor

Commit 438edd6 from PR #27 has a couple of problems with it.

Most seriously the new information is inserted into the middle of the element ID, causing lots of duplicated IDs on the page and corrupted ascender height information.

The other issue is that it uses non-standard attributes which won't validate. The hOCR spec places all its information in the title attribute and I believe it makes most sense to use that mechanism for the extended information, using the x_ prefix to avoid collisions with future extensions to the spec.

@zdenop zdenop closed this as completed in 4317862 Feb 16, 2016
zdenop added a commit that referenced this issue Feb 16, 2016
INCOMPATIBLE fix to hOCR line height information - fixes #225.
zdenop pushed a commit that referenced this issue Feb 16, 2016
This fixes the duplicate line IDs caused by inserting height information
into the middle of the ID and it moves the line height info into
the title attribute like everything else, rather than using non-standard
HTML attributes (which won't validate).

This change may break consumers of the HTML output, but 3.04 has only
been in the wild for 6 months and the current HTML is invalid, so I
believe the benefit outweighs the cost for the fix.
@amitdo amitdo added the bug label May 26, 2016
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
…r#225.

This fixes the duplicate line IDs caused by inserting height information
into the middle of the ID and it moves the line height info into
the title attribute like everything else, rather than using non-standard
HTML attributes (which won't validate).

This change may break consumers of the HTML output, but 3.04 has only
been in the wild for 6 months and the current HTML is invalid, so I 
believe the benefit outweighs the cost for the fix.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
INCOMPATIBLE fix to hOCR line height information - fixes tesseract-ocr#225.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
…r#225.

This fixes the duplicate line IDs caused by inserting height information
into the middle of the ID and it moves the line height info into
the title attribute like everything else, rather than using non-standard
HTML attributes (which won't validate).

This change may break consumers of the HTML output, but 3.04 has only
been in the wild for 6 months and the current HTML is invalid, so I 
believe the benefit outweighs the cost for the fix.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
INCOMPATIBLE fix to hOCR line height information - fixes tesseract-ocr#225.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
…r#225.

This fixes the duplicate line IDs caused by inserting height information
into the middle of the ID and it moves the line height info into
the title attribute like everything else, rather than using non-standard
HTML attributes (which won't validate).

This change may break consumers of the HTML output, but 3.04 has only
been in the wild for 6 months and the current HTML is invalid, so I 
believe the benefit outweighs the cost for the fix.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
INCOMPATIBLE fix to hOCR line height information - fixes tesseract-ocr#225.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
…r#225.

This fixes the duplicate line IDs caused by inserting height information
into the middle of the ID and it moves the line height info into
the title attribute like everything else, rather than using non-standard
HTML attributes (which won't validate).

This change may break consumers of the HTML output, but 3.04 has only
been in the wild for 6 months and the current HTML is invalid, so I 
believe the benefit outweighs the cost for the fix.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
INCOMPATIBLE fix to hOCR line height information - fixes tesseract-ocr#225.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants