trying to add user words/patterns again: #2328

bertsky · 2019-03-15T15:13:51Z

pass in ParamsVectors from Tesseract
(carrying values from langdata/config/api)
into LSTMRecognizer::Load and LoadDictionary
after LSTMRecognizer's Dict is initialised
(with default values), reset the variables
user_{words,patterns}_{suffix,file} from the
corresponding entries in the passed vector

- pass in ParamsVectors from Tesseract (carrying values from langdata/config/api) into LSTMRecognizer::Load and LoadDictionary - after LSTMRecognizer's Dict is initialised (with default values), reset the variables user_{words,patterns}_{suffix,file} from the corresponding entries in the passed vector

bertsky · 2019-03-15T15:18:22Z

Replaces previous attempt #2324. The goal is the same:

This fixes #403 and #960 (if lstm_use_matrix=1 so that the LSTMRecognizer loads a LM). The user patterns example does not give me those patterns exclusively, only more than before (but this is another story).

If more parameters need to be shared, this can easily be extended.

Shreeshrii · 2019-03-16T16:07:39Z

@bertsky I am trying to test but not having much success so far. Can you share the user case you tested? Please try out the following issue posted in the forum that has a test image and results too.

https://groups.google.com/forum/#!topic/tesseract-ocr/iLeK3Ny6fcM

bertsky · 2019-03-16T16:28:50Z

@Shreeshrii I will give that user words example a try. As stated above, so far I have tested on your user patterns example from the #403 thread. With the change I get less results violating the user pattern (two capital letters, three digits, two capital letters). I was also going to try changing the weight constants to boost dict results over free symbol paths, but did not have time to test that yet.

bertsky · 2019-03-16T16:31:09Z

Regarding the weight of these new dawgs against others and free symbol beam path, I believe that kWorstDictCertainty in linerec.cpp and kDictRatio in lstmrecognizer.cpp are relevant (both have been living as fixed translation unit constants so far). To make them exclusive would be more difficult, though.

Shreeshrii · 2019-03-16T16:50:48Z

Also see
https://stackoverflow.com/questions/17209919/tesseract-user-patterns/27159887

bertsky · 2019-03-16T16:56:20Z

Yes, I am aware of that and some more Stackoverflow threads on this. If I get the time, I will have a look. In principle these should be solved now (modulo the issue of proper weights or exclusiveness).

bertsky · 2019-03-20T07:57:20Z

Regarding the weight of these new dawgs against others and free symbol beam path, I believe that kWorstDictCertainty in linerec.cpp and kDictRatio in lstmrecognizer.cpp are relevant

Testing revealed that hypothesis to be false. On the other hand, I found that exclusiveness is in fact easy to achieve: One merely needs to skip the non-dawg beams in ExtractBestPaths.

But there is a downside that probably does not make this a good solution either: It is perfectly possible to get no results at all then (inplace of user word/pattern violations), because the beam is generally too narrow. And it does not suffice to increase kBeamWidths to alleviate that: the issue is narrowness in the input hypotheses, not in the path alternatives. (Input hypotheses other than the 2 top scoring LSTM outputs enter the beam only if the 2 top candidates are invalid continuations of previous UnicharCompress codes.)

I am still trying to understand how this could be overcome. It is not enough to simply allow more than 2 top candidates to enter the beam. Maybe the user pattern should actually be used as hard constraint when extending the beam (instead of soft constraint based on path score), not merely when extracting/back-tracking from it. (But it does not work to not pursue non-dawg beams and eliminate the other dawgs from dict_.)

Shreeshrii · 2019-03-20T14:37:56Z

@bertsky Please see #960 (comment)
My earlier tests had indicated that these features have NOT been working for a while.

https://groups.google.com/forum/#!searchin/tesseract-ocr/user-patterns%7Csort:date/tesseract-ocr/S9CIK3jOMWw/MnXAws3qdfMJ had a suggestion about a different approach for this using whitelist - see post by dm.

bertsky · 2019-03-20T15:01:07Z

@Shreeshrii I am not sure what you are trying to say. As stated above, I am aware of #403 and its duplicate, #960. This PR fixes both for LSTMs, and in the proper way (I think). The idea sketched on the ML by dm is a very crude approach that (I believe) cannot be done right behind an API, and is not general enough for arbitrary user words/patterns.

The only remaining problem here is the weight or exclusiveness of these constraints (as discussed by Ray), and how this can be accomplished. My last comment takes on these – I welcome any suggestions.

Shreeshrii · 2019-03-20T17:19:31Z

@bertsky Please see the test files in attached zip file. There is no difference in the output with or without use of user-patterns, so it seems to me that the option does not work.

Here are the commands used to create first set of files. Hopefully the syntax for using lstm_use_matrix is correct.

tesseract ~/TEST/USER/AN1/eng.Arial_Unicode_MS.exp0.tif ~/TEST/USER/AN1/eng.Arial_Unicode_MS  
tesseract ~/TEST/USER/AN1/eng.Arial_Unicode_MS.exp0.tif ~/TEST/USER/AN1/eng.Arial_Unicode_MS-UP  --user-patterns ~/TEST/USER/AN1-patterns.txt -c lstm_use_matrix=1

ubuntu@tesseract-ocr:~/TEST$ wdiff   --no-common ~/TEST/USER/AN1/eng.Arial_Unicode_MS.txt ~/TEST/USER/AN1/eng.Arial_Unicode_MS-UP.txt

======================================================================
ubuntu@tesseract-ocr:~/TEST$ wdiff  --no-common  ~/TEST/USER/AN2/eng.Arial_Unicode_MS.txt ~/TEST/USER/AN2/eng.Arial_Unicode_MS-UP.txt

======================================================================
ubuntu@tesseract-ocr:~/TEST$ wdiff  --no-common  ~/TEST/USER/AN3/eng.Arial_Unicode_MS.txt ~/TEST/USER/AN3/eng.Arial_Unicode_MS-UP.txt

======================================================================
ubuntu@tesseract-ocr:~/TEST$ wdiff  --no-common  ~/TEST/USER/TEL/eng.Arial_Unicode_MS.txt ~/TEST/USER/TEL/eng.Arial_Unicode_MS-UP.txt

======================================================================

user-patterns-test.zip

bertsky · 2019-03-20T17:56:16Z

unzip user-patterns-test.zip
sed -i 's,~/TEST/,,g;s/wdiff --no-common/diff -u/' USER/patterns-test.sh
bash USER/patterns-test.sh

yields:

--- USER/AN1/eng.Arial_Unicode_MS.txt	2019-03-20 18:52:57.603430451 +0100
+++ USER/AN1/eng.Arial_Unicode_MS-UP.txt	2019-03-20 18:52:58.199440833 +0100
@@ -1,12 +1,12 @@
 DQ2679M
-LO62171
+LO6217I
 QK2101G
 JB0363H
 KN2873M
 ZB0929J
 JF3829W
 YNO584J
-$V8400Q
+SV8400Q
 FY4523X
 KS0016J
 OB3016R
@@ -14,7 +14,7 @@
 QH0205V
 UH2093Z
 GW3760Y
-$O2306T
+SO2306T
 XT8204F
 MR6804I
 OX5866M

bertsky · 2019-03-20T18:05:42Z

Also, on your testdata from #403: With the version visible here I get...

--- lstm-user-pattern-example-shreeshrii.tess-without-patterns.txt	2019-03-11 23:36:12.290519581 +0100
+++ lstm-user-pattern-example-shreeshrii.tess-with-patterns.txt	2019-03-19 22:02:51.265794082 +0100
@@ -1,7 +1,7 @@
 OOLI9T7J
 OX345PT
 PT789SM
-BA4090T
+BA409QT
 OMOOKM
 WE432LM

When decoding with user patterns exclusively, I get...

--- lstm-user-pattern-example-shreeshrii.tess-without-patterns.txt	2019-03-11 23:36:12.290519581 +0100
+++ lstm-user-pattern-example-shreeshrii.tess-with-patterns.txt	2019-03-20 09:12:02.214497123 +0100
@@ -1,21 +1,15 @@
-OOLI9T7J
 OX345PT
 PT789SM
-BA4090T
-OMOOKM
+BA409QT
 WE432LM
 
-OOLI9T7
 OX345PT
 PT789SM
 BA409QT
-OMOOKMI
 WE432LM
 
-OOLI9T7
 OX345PT
 PT789SM
 BA409QT
-OMOOKMI
 WE432LM

So clearly, the user patterns make a difference, and exclusivity does not coerce more results, but simply causes violating lines to not produce results any more.

Shreeshrii · 2019-03-20T18:51:07Z

So, looks like my version doesn't have all your changes.

I had used the following way to get both the PRs, maybe that is not correct. Please let me know the best way to get all the required commits.

git checkout master
git pull origin master
git fetch origin pull/2294/head:pr-2294
git checkout pr-2294
git fetch origin pull/2328/head:pr-2328
git checkout pr-2328
make
make training

bertsky · 2019-03-20T19:15:38Z

You are lacking the actual merge:

git checkout pr-2328
git merge pr-2294

This should leave the workdir with changes from both branches.

Shreeshrii · 2019-03-20T19:54:23Z

@bertsky Thank you! That worked and now I can see the differences in output when using user-patterns.

…

On Thu, Mar 21, 2019 at 12:45 AM Robert Sachunsky ***@***.***> wrote: You are lacking the actual merge: git checkout pr-2328 git merge pr-2294 This should leave the workdir with changes from both branches. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2328 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o9bCxWIjqT5r2Nw5A6zTw3XgodDxks5vYojqgaJpZM4b2sIw> .

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii · 2019-03-20T20:26:43Z

As a further test, I put all the different patterns in one patterns file and used it with the different test images. The accuracy of the recognition went down in one case (AN2).

\A\A\d\d\d\d\A
\A\A\d\d\d\A
\A\A\A\A\A\d\d\d\A
1-\d\d\d-\d\d\d-\d\d\d\d

So would the recommended option be to use a only a single pattern or include only dissimilar patterns in the user-patterns file?

Also, Should -c lstm_use_matrix=1 be put in a config file to make it more user-friendly?

It would also be helpful to add a unittest for this.

@bertsky Thanks for fixing this long pending issue.

bertsky · 2019-03-20T21:14:19Z

So would the recommended option be to use a only a single pattern or only dissimilar patterns?

From my understanding, re-using a unified list of patterns over different kinds of input indeed can decrease its effectiveness, because they would all compete over the (limited) positions in the (dawg) beams. If at all possible, I would advise to keep them separate. ~~Naturally, this will become a requirement, should we be able to find a solution to increase the patterns' weight or raising them to exclusiveness.~~

Also, Should -c lstm_use_matrix=1 be put in a config file to make it more user-friendly?

No, as this setting is the default. IMHO, that variable name is not aptly chosen and should be renamed to lstm_use_lm.

It would also be helpful to add a unittest for this.

I fully agree.

Thanks for fixing this long pending issue.

You are welcome. As long as the above issues of weight/exclusiveness and beam scarcity remain, I consider this solution not quite complete, though.

Shreeshrii · 2019-03-21T11:59:38Z

As long as the above issues of weight/exclusiveness and beam scarcity remain, I consider this solution not quite complete, though.

@bertsky Should this PR be merged now as a partial fix?

I remember reading somewhere that user words and patterns are only supposed to be a hint so that functionality has certainly been restored.

OK, found the reference - #297

bertsky · 2019-03-21T17:17:30Z

I remember reading somewhere that user words and patterns are only supposed to be a hint

Oh, I see. Well, then attaining exclusiveness would probably be a separate task, like with tessedit_enable_dict_correction.

But the weights of the LM dawgs are all fixed in the LSTM implementation, while they used to be user-configurable before. Having just built a bridge for such parameters into LSTMRecognizer, I think I should make those weights accessible, too.

OK, found the reference - #297

Interesting, thanks for bringing this to attention!

There used to be several parameters/variables involved:

language_model_penalty_non_dict_word with calltree:

LanguageModel::ComputeAdjustedPathCost
LanguageModel::AddViterbiStateEntry
LanguageModel::UpdateState
Wordrec::UpdateSegSearchNodes
Wordrec::InitialSegSearch
Wordrec::SegSearch
Wordrec::chop_word_main
Wordrec::cc_recog
Tesseract::recog_word_recursive
Tesseract::recog_word
Tesseract::tess_segment_pass_n
Tesseract::match_word_pass_n
Tesseract::classify_word_pass1
Tesseract::classify_word_and_language
Tesseract::RecogAllWordsPassN
Tesseract::recog_all_words
TessBaseAPI::Recognize

stopper_nondict_certainty_base with calltree:

Dict::AcceptableChoice			Dict::AcceptableResult
LanguageModel::UpdateBestChoice		│
LanguageModel::AddViterbiStateEntry	│	
LanguageModel::UpdateState		│
Wordrec::UpdateSegSearchNodes		│
Wordrec::InitialSegSearch		│
Wordrec::SegSearch			│
Wordrec::chop_word_main			│
Wordrec::cc_recog			│
Tesseract::recog_word_recursive		Tesseract::SearchWords
Tesseract::recog_word			Tesseract::LSTMRecognizeWord
Tesseract::tess_segment_pass_n		│
Tesseract::match_word_pass_n		│
Tesseract::classify_word_pass1		┘
Tesseract::classify_word_and_language
Tesseract::RecogAllWordsPassN
Tesseract::recog_all_words
TessBaseAPI::Recognize

max_permuter_attempts with calltree:

Dict::dawg_permute_and_select
Dict::NoDangerousAmbig
LanguageModel::ConstructWord
LanguageModel::UpdateBestChoice
LanguageModel::AddViterbiStateEntry
LanguageModel::UpdateState
Wordrec::UpdateSegSearchNodes
Wordrec::InitialSegSearch
Wordrec::SegSearch
Wordrec::chop_word_main
Wordrec::cc_recog
Tesseract::recog_word_recursive
Tesseract::recog_word
Tesseract::tess_segment_pass_n
Tesseract::match_word_pass_n
Tesseract::classify_word_pass1
Tesseract::classify_word_and_language
Tesseract::RecogAllWordsPassN
Tesseract::recog_all_words
TessBaseAPI::Recognize

tessedit_enable_dict_correction with calltree:

Tesseract::dictionary_correction_pass
WERD_RES::best_choices
WERD_CHOICE_IT::data
Dict::valid_word

So: stopper_nondict_certainty_base and max_permuter_attempts could be bridged as well, tessedit_enable_dict_correction would still need to find the choice iterator populated with alternatives (which I am not certain #2310 already accomplished), but language_model_penalty_non_dict_word has no direct LSTM correspondence (and I am not sure how to make a useful new parameter out of kWorstDictCertainty and kDictRatio).

Should this PR be merged now as a partial fix?

If nobody has an objection, yes.

zdenop · 2019-03-24T17:22:34Z

@stweil : What is your opinion?

stweil · 2019-03-24T18:39:12Z

Merged now. Thank you, Robert!

Shreeshrii · 2019-07-05T04:26:53Z

Added a wiki example for API usage at https://github.com/tesseract-ocr/tesseract/wiki/APIExample-user_patterns

bertsky · 2019-07-05T19:22:28Z

@Shreeshrii splendid, thanks!

BTW, @noahmetzger nearly finished a patch that will bring us good beam search alternatives. This will yield much better user words/patterns. We could even consider adding a restrictive mode (like tessedit_enable_dict_correction) then, because we would not risk getting empty results.

Shreeshrii · 2019-07-10T15:02:36Z

@noahmetzger Thanks for the PR.

@bertsky Please make the required changes for better user words/patterns.

See a request from user just today - https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/-DxOrS1f4sc/Nnr7ZV6jBwAJ

bertsky mentioned this pull request Mar 15, 2019

trying to add user words/patterns again: #2324

Closed

bertsky mentioned this pull request Mar 15, 2019

LSTM: User patterns do not work #403

Closed

bertsky mentioned this pull request Mar 23, 2019

trying to add tessedit_char_whitelist etc. again: #2294

Merged

stweil merged commit 58423d2 into tesseract-ocr:master Mar 24, 2019

bertsky deleted the lstm-with-user-patterns2 branch March 24, 2019 23:36

bertsky mentioned this pull request Apr 17, 2019

also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py OCR-D/ocrd_tesserocr#7

Closed

bertsky mentioned this pull request Sep 19, 2019

fix langdata (user words/patterns) file suffixes for LSTMs: #2663

Merged

bertsky mentioned this pull request Oct 9, 2019

Change Tesseract output with words coming from an external dictionary #2391

Closed

Shreeshrii mentioned this pull request Dec 10, 2019

Training parts lists tesseract-ocr/tesstrain#131

Closed

amitdo added the enhancement label Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trying to add user words/patterns again: #2328

trying to add user words/patterns again: #2328

bertsky commented Mar 15, 2019 •

edited

Loading

bertsky commented Mar 15, 2019

Shreeshrii commented Mar 16, 2019

bertsky commented Mar 16, 2019 •

edited

Loading

bertsky commented Mar 16, 2019

Shreeshrii commented Mar 16, 2019

bertsky commented Mar 16, 2019 •

edited

Loading

bertsky commented Mar 20, 2019

Shreeshrii commented Mar 20, 2019

bertsky commented Mar 20, 2019

Shreeshrii commented Mar 20, 2019 •

edited

Loading

bertsky commented Mar 20, 2019

bertsky commented Mar 20, 2019

Shreeshrii commented Mar 20, 2019

bertsky commented Mar 20, 2019

Shreeshrii commented Mar 20, 2019 via email

Shreeshrii commented Mar 20, 2019 •

edited

Loading

bertsky commented Mar 20, 2019 •

edited

Loading

Shreeshrii commented Mar 21, 2019

bertsky commented Mar 21, 2019

zdenop commented Mar 24, 2019

stweil commented Mar 24, 2019

Shreeshrii commented Jul 5, 2019 •

edited

Loading

bertsky commented Jul 5, 2019

Shreeshrii commented Jul 10, 2019

trying to add user words/patterns again: #2328

trying to add user words/patterns again: #2328

Conversation

bertsky commented Mar 15, 2019 • edited Loading

bertsky commented Mar 15, 2019

Shreeshrii commented Mar 16, 2019

bertsky commented Mar 16, 2019 • edited Loading

bertsky commented Mar 16, 2019

Shreeshrii commented Mar 16, 2019

bertsky commented Mar 16, 2019 • edited Loading

bertsky commented Mar 20, 2019

Shreeshrii commented Mar 20, 2019

bertsky commented Mar 20, 2019

Shreeshrii commented Mar 20, 2019 • edited Loading

bertsky commented Mar 20, 2019

bertsky commented Mar 20, 2019

Shreeshrii commented Mar 20, 2019

bertsky commented Mar 20, 2019

Shreeshrii commented Mar 20, 2019 via email

Shreeshrii commented Mar 20, 2019 • edited Loading

bertsky commented Mar 20, 2019 • edited Loading

Shreeshrii commented Mar 21, 2019

bertsky commented Mar 21, 2019

zdenop commented Mar 24, 2019

stweil commented Mar 24, 2019

Shreeshrii commented Jul 5, 2019 • edited Loading

bertsky commented Jul 5, 2019

Shreeshrii commented Jul 10, 2019

bertsky commented Mar 15, 2019 •

edited

Loading

bertsky commented Mar 16, 2019 •

edited

Loading

bertsky commented Mar 16, 2019 •

edited

Loading

Shreeshrii commented Mar 20, 2019 •

edited

Loading

Shreeshrii commented Mar 20, 2019 •

edited

Loading

bertsky commented Mar 20, 2019 •

edited

Loading

Shreeshrii commented Jul 5, 2019 •

edited

Loading