Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patterns in --user-patterns appear to be ignored. #2940

Open
eihli opened this issue Apr 2, 2020 · 4 comments
Open

Patterns in --user-patterns appear to be ignored. #2940

eihli opened this issue Apr 2, 2020 · 4 comments
Labels

Comments

@eihli
Copy link

eihli commented Apr 2, 2020


This is either a bug, or the documentation needs to be clarified to say that patterns are merely suggestions to Tesseract and that the patterns may not be respected (if that is the case).

Environment

  • Tesseract Version:
    tesseract 5.0.0-alpha-647-g4a00b
    leptonica-1.80.0

  • Platform:

Linux 4.19.113-1-MANJARO x86_64 GNU/Linux

Current Behavior:

14

I have a patterns file of 5 digits.

\d\d\d\d\d

tesseract 14.png - --oem 3 --psm 7 -l eng --user-patterns /tmp/2294/patterns.txt

Tesseract recognizes the comma.

Expected Behavior:

I expect the pattern to be respected.

I cannot find anything in the manpage or other documentation that says how strictly patterns are respected.

https://tesseract-ocr.github.io/tessdoc/ImproveQuality#dictionaries-word-lists-and-patterns
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#CONFIGFILE

Suggested Fix:

I don't know. If this is expected behavior and patterns are mere suggestions that are not necessarily respected, then updating the documentation is the fix. If someone can confirm this, I'll see if I can understand the code well enough to write accurate documentation around it.

@amitdo
Copy link
Collaborator

amitdo commented Apr 6, 2020

It's not a bug. the patterns are used as hints.

\d\d\d\d\d

The number in the image is 30,480
so the pattern should be
\d\d,\d\d\d

If you don't want the comma. you should remove it yourself by post-processing Tesseract's output.

@eihli
Copy link
Author

eihli commented Apr 6, 2020

Thanks for clarifying that they are meant to be just hints. If I created a PR with changes to the documentation to clarify the behavior, do you think it would be accepted? I have seen several related questions around the web.

How to apply user patterns
https://groups.google.com/forum/#!topic/tesseract-ocr/vvnIBl7V3Q8

Tesseract bazaar option: how to make tesseract match the words in user-words first?
https://groups.google.com/forum/#!topic/tesseract-ocr/5vFqVcJmHnM

Disclosure: I don't know the order of preference - which DAWG does tesseract check first AND which DAWG over-rides the others. ex. "thls" is not in #1 but, say, is in #4 - will tesseract NOT jiggle the 'l' into an 'i' (which then matches in #1) or will it go with #4? Ray?
https://tesseract-ocr.repairfaq.org/allaboutdawg.html

I'd like to document the order of preference between wordlists, whitelists, and user patterns. I'd also like to document the "hint strength" tesseract gives to each and if that strength is configurable.

@amitdo
Copy link
Collaborator

amitdo commented Apr 6, 2020

About the PR, nothing can be guaranteed in advance.

Sorry, I don't have answers to your other questions.

@amitdo
Copy link
Collaborator

amitdo commented Apr 6, 2020

About the strength thing.

Here Is two configurable parameters:

language_model_penalty_non_dict_word
language_model_penalty_non_freq_dict_word

Note that these parameter are used only by the legacy engine.

In general, unlike the legacy engine, the neural network based engine has very little configurable parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants