-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patterns in --user-patterns appear to be ignored. #2940
Comments
It's not a bug. the patterns are used as hints.
The number in the image is If you don't want the comma. you should remove it yourself by post-processing Tesseract's output. |
Thanks for clarifying that they are meant to be just hints. If I created a PR with changes to the documentation to clarify the behavior, do you think it would be accepted? I have seen several related questions around the web.
I'd like to document the order of preference between wordlists, whitelists, and user patterns. I'd also like to document the "hint strength" tesseract gives to each and if that strength is configurable. |
About the PR, nothing can be guaranteed in advance. Sorry, I don't have answers to your other questions. |
About the strength thing. Here Is two configurable parameters:
Note that these parameter are used only by the legacy engine. In general, unlike the legacy engine, the neural network based engine has very little configurable parameters. |
This is either a bug, or the documentation needs to be clarified to say that patterns are merely suggestions to Tesseract and that the patterns may not be respected (if that is the case).
Environment
Tesseract Version:
tesseract 5.0.0-alpha-647-g4a00b
leptonica-1.80.0
Platform:
Linux 4.19.113-1-MANJARO x86_64 GNU/Linux
Current Behavior:
I have a patterns file of 5 digits.
\d\d\d\d\d
tesseract 14.png - --oem 3 --psm 7 -l eng --user-patterns /tmp/2294/patterns.txt
Tesseract recognizes the comma.
Expected Behavior:
I expect the pattern to be respected.
I cannot find anything in the manpage or other documentation that says how strictly patterns are respected.
https://tesseract-ocr.github.io/tessdoc/ImproveQuality#dictionaries-word-lists-and-patterns
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#CONFIGFILE
Suggested Fix:
I don't know. If this is expected behavior and patterns are mere suggestions that are not necessarily respected, then updating the documentation is the fix. If someone can confirm this, I'll see if I can understand the code well enough to write accurate documentation around it.
The text was updated successfully, but these errors were encountered: