-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Textline Box Files Tesseract 4.0 bad wording? #2357
Comments
The lstm training does not really need individual char coordinates. For each char, you can give coordinates of its entire line. |
Well, it's not wrong. Tesseract will accept it. |
Multiple formats of box files are accepted by Tesseract4 for LSTM training, though they are different from the one used by Tesseract 3.
Please note that box files generated using
Attached zip file has a sample tif file and the different types of box files for it so that it is easy to see the additional line with TAB character used to mark EOL in the box files for Tesseract4. |
WordStr box filescreate phototest-wordstr.box from phototest.tif
Review and edit phototest-wordstr.box to have the correct text for each line In case a groundtruth text file is available for the image, you can try to automate the edit process. Remove the OCRed text from the box file
Review phototest.box to make sure that the lines match. |
@Shreeshrii many thanks for great and detailed explanation! I''ll put some changes into wiki to clarify this question. |
The lstmbox and wordstrbox options have been added recently. Please try them out with your image files. Thank you for changing the wiki to clarify this. |
»WordStr« format: the lines beginning with tab have 1 space character before the first digit. |
Please use the forum for asking questions. |
Thank! |
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 states that:
https://github.com/tesseract-ocr/tesseract/wiki/Making-Box-Files---4.0 states pretty the same:
But example box file has individual character bboxes:
Does it means that we need to create character bboxes AND textline bboxes?
Thus, suggested wording needs to be something like:
"the required format is still the tiff/box file pair, except that the boxes need to cover a textline in addition to individual characters."?
Or example box file is wrong?
The text was updated successfully, but these errors were encountered: