-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trying to add user words/patterns again: #2328
Conversation
- pass in ParamsVectors from Tesseract (carrying values from langdata/config/api) into LSTMRecognizer::Load and LoadDictionary - after LSTMRecognizer's Dict is initialised (with default values), reset the variables user_{words,patterns}_{suffix,file} from the corresponding entries in the passed vector
Replaces previous attempt #2324. The goal is the same: This fixes #403 and #960 (if If more parameters need to be shared, this can easily be extended. |
@bertsky I am trying to test but not having much success so far. Can you share the user case you tested? Please try out the following issue posted in the forum that has a test image and results too. https://groups.google.com/forum/#!topic/tesseract-ocr/iLeK3Ny6fcM |
@Shreeshrii I will give that user words example a try. As stated above, so far I have tested on your user patterns example from the #403 thread. With the change I get less results violating the user pattern (two capital letters, three digits, two capital letters). I was also going to try changing the weight constants to boost dict results over free symbol paths, but did not have time to test that yet. |
Regarding the weight of these new dawgs against others and free symbol beam path, I believe that |
Yes, I am aware of that and some more Stackoverflow threads on this. If I get the time, I will have a look. In principle these should be solved now (modulo the issue of proper weights or exclusiveness). |
Testing revealed that hypothesis to be false. On the other hand, I found that exclusiveness is in fact easy to achieve: One merely needs to skip the non-dawg beams in But there is a downside that probably does not make this a good solution either: It is perfectly possible to get no results at all then (inplace of user word/pattern violations), because the beam is generally too narrow. And it does not suffice to increase I am still trying to understand how this could be overcome. It is not enough to simply allow more than 2 top candidates to enter the beam. Maybe the user pattern should actually be used as hard constraint when extending the beam (instead of soft constraint based on path score), not merely when extracting/back-tracking from it. (But it does not work to not pursue non-dawg beams and eliminate the other dawgs from |
@bertsky Please see #960 (comment) https://groups.google.com/forum/#!searchin/tesseract-ocr/user-patterns%7Csort:date/tesseract-ocr/S9CIK3jOMWw/MnXAws3qdfMJ had a suggestion about a different approach for this using whitelist - see post by dm. |
@Shreeshrii I am not sure what you are trying to say. As stated above, I am aware of #403 and its duplicate, #960. This PR fixes both for LSTMs, and in the proper way (I think). The idea sketched on the ML by dm is a very crude approach that (I believe) cannot be done right behind an API, and is not general enough for arbitrary user words/patterns. The only remaining problem here is the weight or exclusiveness of these constraints (as discussed by Ray), and how this can be accomplished. My last comment takes on these – I welcome any suggestions. |
@bertsky Please see the test files in attached zip file. There is no difference in the output with or without use of Here are the commands used to create first set of files. Hopefully the syntax for using
|
unzip user-patterns-test.zip
sed -i 's,~/TEST/,,g;s/wdiff --no-common/diff -u/' USER/patterns-test.sh
bash USER/patterns-test.sh yields: --- USER/AN1/eng.Arial_Unicode_MS.txt 2019-03-20 18:52:57.603430451 +0100
+++ USER/AN1/eng.Arial_Unicode_MS-UP.txt 2019-03-20 18:52:58.199440833 +0100
@@ -1,12 +1,12 @@
DQ2679M
-LO62171
+LO6217I
QK2101G
JB0363H
KN2873M
ZB0929J
JF3829W
YNO584J
-$V8400Q
+SV8400Q
FY4523X
KS0016J
OB3016R
@@ -14,7 +14,7 @@
QH0205V
UH2093Z
GW3760Y
-$O2306T
+SO2306T
XT8204F
MR6804I
OX5866M |
Also, on your testdata from #403: With the version visible here I get... --- lstm-user-pattern-example-shreeshrii.tess-without-patterns.txt 2019-03-11 23:36:12.290519581 +0100
+++ lstm-user-pattern-example-shreeshrii.tess-with-patterns.txt 2019-03-19 22:02:51.265794082 +0100
@@ -1,7 +1,7 @@
OOLI9T7J
OX345PT
PT789SM
-BA4090T
+BA409QT
OMOOKM
WE432LM When decoding with user patterns exclusively, I get... --- lstm-user-pattern-example-shreeshrii.tess-without-patterns.txt 2019-03-11 23:36:12.290519581 +0100
+++ lstm-user-pattern-example-shreeshrii.tess-with-patterns.txt 2019-03-20 09:12:02.214497123 +0100
@@ -1,21 +1,15 @@
-OOLI9T7J
OX345PT
PT789SM
-BA4090T
-OMOOKM
+BA409QT
WE432LM
-OOLI9T7
OX345PT
PT789SM
BA409QT
-OMOOKMI
WE432LM
-OOLI9T7
OX345PT
PT789SM
BA409QT
-OMOOKMI
WE432LM
So clearly, the user patterns make a difference, and exclusivity does not coerce more results, but simply causes violating lines to not produce results any more. |
So, looks like my version doesn't have all your changes. I had used the following way to get both the PRs, maybe that is not correct. Please let me know the best way to get all the required commits. git checkout master |
You are lacking the actual merge: git checkout pr-2328
git merge pr-2294 This should leave the workdir with changes from both branches. |
@bertsky Thank you! That worked and now I can see the differences in output
when using user-patterns.
…On Thu, Mar 21, 2019 at 12:45 AM Robert Sachunsky ***@***.***> wrote:
You are lacking the actual merge:
git checkout pr-2328
git merge pr-2294
This should leave the workdir with changes from both branches.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2328 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o9bCxWIjqT5r2Nw5A6zTw3XgodDxks5vYojqgaJpZM4b2sIw>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
As a further test, I put all the different patterns in one patterns file and used it with the different test images. The accuracy of the recognition went down in one case (AN2). \A\A\d\d\d\d\A So would the recommended option be to use a only a single pattern or include only dissimilar patterns in the user-patterns file? Also, Should It would also be helpful to add a unittest for this. @bertsky Thanks for fixing this long pending issue. |
From my understanding, re-using a unified list of patterns over different kinds of input indeed can decrease its effectiveness, because they would all compete over the (limited) positions in the (dawg) beams. If at all possible, I would advise to keep them separate.
No, as this setting is the default. IMHO, that variable name is not aptly chosen and should be renamed to
I fully agree.
You are welcome. As long as the above issues of weight/exclusiveness and beam scarcity remain, I consider this solution not quite complete, though. |
@bertsky Should this PR be merged now as a partial fix? I remember reading somewhere that user words and patterns are only supposed to be a OK, found the reference - #297 |
Oh, I see. Well, then attaining exclusiveness would probably be a separate task, like with But the weights of the LM dawgs are all fixed in the LSTM implementation, while they used to be user-configurable before. Having just built a bridge for such parameters into
Interesting, thanks for bringing this to attention! There used to be several parameters/variables involved:
So:
If nobody has an objection, yes. |
@stweil : What is your opinion? |
Merged now. Thank you, Robert! |
Added a wiki example for API usage at https://github.com/tesseract-ocr/tesseract/wiki/APIExample-user_patterns |
@Shreeshrii splendid, thanks! BTW, @noahmetzger nearly finished a patch that will bring us good beam search alternatives. This will yield much better user words/patterns. We could even consider adding a restrictive mode (like |
@noahmetzger Thanks for the PR. @bertsky Please make the required changes for better user words/patterns. See a request from user just today - https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/-DxOrS1f4sc/Nnr7ZV6jBwAJ |
ParamsVectors
fromTesseract
(carrying values from langdata/config/api)
into
LSTMRecognizer::Load
andLoadDictionary
LSTMRecognizer
'sDict
is initialised(with default values), reset the variables
user_{words,patterns}_{suffix,file} from the
corresponding entries in the passed vector