Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trying to add user words/patterns again: #2328

Merged
merged 1 commit into from
Mar 24, 2019

Conversation

bertsky
Copy link
Contributor

@bertsky bertsky commented Mar 15, 2019

  • pass in ParamsVectors from Tesseract
    (carrying values from langdata/config/api)
    into LSTMRecognizer::Load and LoadDictionary
  • after LSTMRecognizer's Dict is initialised
    (with default values), reset the variables
    user_{words,patterns}_{suffix,file} from the
    corresponding entries in the passed vector

- pass in ParamsVectors from Tesseract
  (carrying values from langdata/config/api)
  into LSTMRecognizer::Load and LoadDictionary
- after LSTMRecognizer's Dict is initialised
  (with default values), reset the variables
  user_{words,patterns}_{suffix,file} from the
  corresponding entries in the passed vector
@bertsky
Copy link
Contributor Author

bertsky commented Mar 15, 2019

Replaces previous attempt #2324. The goal is the same:

This fixes #403 and #960 (if lstm_use_matrix=1 so that the LSTMRecognizer loads a LM). The user patterns example does not give me those patterns exclusively, only more than before (but this is another story).

If more parameters need to be shared, this can easily be extended.

@Shreeshrii
Copy link
Collaborator

@bertsky I am trying to test but not having much success so far. Can you share the user case you tested? Please try out the following issue posted in the forum that has a test image and results too.

https://groups.google.com/forum/#!topic/tesseract-ocr/iLeK3Ny6fcM

@bertsky
Copy link
Contributor Author

bertsky commented Mar 16, 2019

@Shreeshrii I will give that user words example a try. As stated above, so far I have tested on your user patterns example from the #403 thread. With the change I get less results violating the user pattern (two capital letters, three digits, two capital letters). I was also going to try changing the weight constants to boost dict results over free symbol paths, but did not have time to test that yet.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 16, 2019

Regarding the weight of these new dawgs against others and free symbol beam path, I believe that kWorstDictCertainty in linerec.cpp and kDictRatio in lstmrecognizer.cpp are relevant (both have been living as fixed translation unit constants so far). To make them exclusive would be more difficult, though.

@Shreeshrii
Copy link
Collaborator

@bertsky
Copy link
Contributor Author

bertsky commented Mar 16, 2019

Yes, I am aware of that and some more Stackoverflow threads on this. If I get the time, I will have a look. In principle these should be solved now (modulo the issue of proper weights or exclusiveness).

@bertsky
Copy link
Contributor Author

bertsky commented Mar 20, 2019

Regarding the weight of these new dawgs against others and free symbol beam path, I believe that kWorstDictCertainty in linerec.cpp and kDictRatio in lstmrecognizer.cpp are relevant

Testing revealed that hypothesis to be false. On the other hand, I found that exclusiveness is in fact easy to achieve: One merely needs to skip the non-dawg beams in ExtractBestPaths.

But there is a downside that probably does not make this a good solution either: It is perfectly possible to get no results at all then (inplace of user word/pattern violations), because the beam is generally too narrow. And it does not suffice to increase kBeamWidths to alleviate that: the issue is narrowness in the input hypotheses, not in the path alternatives. (Input hypotheses other than the 2 top scoring LSTM outputs enter the beam only if the 2 top candidates are invalid continuations of previous UnicharCompress codes.)

I am still trying to understand how this could be overcome. It is not enough to simply allow more than 2 top candidates to enter the beam. Maybe the user pattern should actually be used as hard constraint when extending the beam (instead of soft constraint based on path score), not merely when extracting/back-tracking from it. (But it does not work to not pursue non-dawg beams and eliminate the other dawgs from dict_.)

@Shreeshrii
Copy link
Collaborator

@bertsky Please see #960 (comment)
My earlier tests had indicated that these features have NOT been working for a while.

https://groups.google.com/forum/#!searchin/tesseract-ocr/user-patterns%7Csort:date/tesseract-ocr/S9CIK3jOMWw/MnXAws3qdfMJ had a suggestion about a different approach for this using whitelist - see post by dm.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 20, 2019

@Shreeshrii I am not sure what you are trying to say. As stated above, I am aware of #403 and its duplicate, #960. This PR fixes both for LSTMs, and in the proper way (I think). The idea sketched on the ML by dm is a very crude approach that (I believe) cannot be done right behind an API, and is not general enough for arbitrary user words/patterns.

The only remaining problem here is the weight or exclusiveness of these constraints (as discussed by Ray), and how this can be accomplished. My last comment takes on these – I welcome any suggestions.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 20, 2019

@bertsky Please see the test files in attached zip file. There is no difference in the output with or without use of user-patterns, so it seems to me that the option does not work.

Here are the commands used to create first set of files. Hopefully the syntax for using lstm_use_matrix is correct.

tesseract ~/TEST/USER/AN1/eng.Arial_Unicode_MS.exp0.tif ~/TEST/USER/AN1/eng.Arial_Unicode_MS  
tesseract ~/TEST/USER/AN1/eng.Arial_Unicode_MS.exp0.tif ~/TEST/USER/AN1/eng.Arial_Unicode_MS-UP  --user-patterns ~/TEST/USER/AN1-patterns.txt -c lstm_use_matrix=1

ubuntu@tesseract-ocr:~/TEST$ wdiff   --no-common ~/TEST/USER/AN1/eng.Arial_Unicode_MS.txt ~/TEST/USER/AN1/eng.Arial_Unicode_MS-UP.txt

======================================================================
ubuntu@tesseract-ocr:~/TEST$ wdiff  --no-common  ~/TEST/USER/AN2/eng.Arial_Unicode_MS.txt ~/TEST/USER/AN2/eng.Arial_Unicode_MS-UP.txt

======================================================================
ubuntu@tesseract-ocr:~/TEST$ wdiff  --no-common  ~/TEST/USER/AN3/eng.Arial_Unicode_MS.txt ~/TEST/USER/AN3/eng.Arial_Unicode_MS-UP.txt

======================================================================
ubuntu@tesseract-ocr:~/TEST$ wdiff  --no-common  ~/TEST/USER/TEL/eng.Arial_Unicode_MS.txt ~/TEST/USER/TEL/eng.Arial_Unicode_MS-UP.txt

======================================================================

user-patterns-test.zip

@bertsky
Copy link
Contributor Author

bertsky commented Mar 20, 2019

unzip user-patterns-test.zip
sed -i 's,~/TEST/,,g;s/wdiff --no-common/diff -u/' USER/patterns-test.sh
bash USER/patterns-test.sh

yields:

--- USER/AN1/eng.Arial_Unicode_MS.txt	2019-03-20 18:52:57.603430451 +0100
+++ USER/AN1/eng.Arial_Unicode_MS-UP.txt	2019-03-20 18:52:58.199440833 +0100
@@ -1,12 +1,12 @@
 DQ2679M
-LO62171
+LO6217I
 QK2101G
 JB0363H
 KN2873M
 ZB0929J
 JF3829W
 YNO584J
-$V8400Q
+SV8400Q
 FY4523X
 KS0016J
 OB3016R
@@ -14,7 +14,7 @@
 QH0205V
 UH2093Z
 GW3760Y
-$O2306T
+SO2306T
 XT8204F
 MR6804I
 OX5866M

@bertsky
Copy link
Contributor Author

bertsky commented Mar 20, 2019

Also, on your testdata from #403: With the version visible here I get...

--- lstm-user-pattern-example-shreeshrii.tess-without-patterns.txt	2019-03-11 23:36:12.290519581 +0100
+++ lstm-user-pattern-example-shreeshrii.tess-with-patterns.txt	2019-03-19 22:02:51.265794082 +0100
@@ -1,7 +1,7 @@
 OOLI9T7J
 OX345PT
 PT789SM
-BA4090T
+BA409QT
 OMOOKM
 WE432LM

When decoding with user patterns exclusively, I get...

--- lstm-user-pattern-example-shreeshrii.tess-without-patterns.txt	2019-03-11 23:36:12.290519581 +0100
+++ lstm-user-pattern-example-shreeshrii.tess-with-patterns.txt	2019-03-20 09:12:02.214497123 +0100
@@ -1,21 +1,15 @@
-OOLI9T7J
 OX345PT
 PT789SM
-BA4090T
-OMOOKM
+BA409QT
 WE432LM
 
-OOLI9T7
 OX345PT
 PT789SM
 BA409QT
-OMOOKMI
 WE432LM
 
-OOLI9T7
 OX345PT
 PT789SM
 BA409QT
-OMOOKMI
 WE432LM

So clearly, the user patterns make a difference, and exclusivity does not coerce more results, but simply causes violating lines to not produce results any more.

@Shreeshrii
Copy link
Collaborator

So, looks like my version doesn't have all your changes.

I had used the following way to get both the PRs, maybe that is not correct. Please let me know the best way to get all the required commits.

git checkout master
git pull origin master
git fetch origin pull/2294/head:pr-2294
git checkout pr-2294
git fetch origin pull/2328/head:pr-2328
git checkout pr-2328
make
make training

@bertsky
Copy link
Contributor Author

bertsky commented Mar 20, 2019

You are lacking the actual merge:

git checkout pr-2328
git merge pr-2294

This should leave the workdir with changes from both branches.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 20, 2019 via email

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 20, 2019

As a further test, I put all the different patterns in one patterns file and used it with the different test images. The accuracy of the recognition went down in one case (AN2).

\A\A\d\d\d\d\A
\A\A\d\d\d\A
\A\A\A\A\A\d\d\d\A
1-\d\d\d-\d\d\d-\d\d\d\d

So would the recommended option be to use a only a single pattern or include only dissimilar patterns in the user-patterns file?

Also, Should -c lstm_use_matrix=1 be put in a config file to make it more user-friendly?

It would also be helpful to add a unittest for this.

@bertsky Thanks for fixing this long pending issue.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 20, 2019

So would the recommended option be to use a only a single pattern or only dissimilar patterns?

From my understanding, re-using a unified list of patterns over different kinds of input indeed can decrease its effectiveness, because they would all compete over the (limited) positions in the (dawg) beams. If at all possible, I would advise to keep them separate. Naturally, this will become a requirement, should we be able to find a solution to increase the patterns' weight or raising them to exclusiveness.

Also, Should -c lstm_use_matrix=1 be put in a config file to make it more user-friendly?

No, as this setting is the default. IMHO, that variable name is not aptly chosen and should be renamed to lstm_use_lm.

It would also be helpful to add a unittest for this.

I fully agree.

Thanks for fixing this long pending issue.

You are welcome. As long as the above issues of weight/exclusiveness and beam scarcity remain, I consider this solution not quite complete, though.

@Shreeshrii
Copy link
Collaborator

As long as the above issues of weight/exclusiveness and beam scarcity remain, I consider this solution not quite complete, though.

@bertsky Should this PR be merged now as a partial fix?

I remember reading somewhere that user words and patterns are only supposed to be a hint so that functionality has certainly been restored.

OK, found the reference - #297

@bertsky
Copy link
Contributor Author

bertsky commented Mar 21, 2019

I remember reading somewhere that user words and patterns are only supposed to be a hint

Oh, I see. Well, then attaining exclusiveness would probably be a separate task, like with tessedit_enable_dict_correction.

But the weights of the LM dawgs are all fixed in the LSTM implementation, while they used to be user-configurable before. Having just built a bridge for such parameters into LSTMRecognizer, I think I should make those weights accessible, too.

OK, found the reference - #297

Interesting, thanks for bringing this to attention!

There used to be several parameters/variables involved:

  • language_model_penalty_non_dict_word with calltree:
LanguageModel::ComputeAdjustedPathCost
LanguageModel::AddViterbiStateEntry
LanguageModel::UpdateState
Wordrec::UpdateSegSearchNodes
Wordrec::InitialSegSearch
Wordrec::SegSearch
Wordrec::chop_word_main
Wordrec::cc_recog
Tesseract::recog_word_recursive
Tesseract::recog_word
Tesseract::tess_segment_pass_n
Tesseract::match_word_pass_n
Tesseract::classify_word_pass1
Tesseract::classify_word_and_language
Tesseract::RecogAllWordsPassN
Tesseract::recog_all_words
TessBaseAPI::Recognize
  • stopper_nondict_certainty_base with calltree:
Dict::AcceptableChoice			Dict::AcceptableResult
LanguageModel::UpdateBestChoice		│
LanguageModel::AddViterbiStateEntry	│	
LanguageModel::UpdateState		│
Wordrec::UpdateSegSearchNodes		│
Wordrec::InitialSegSearch		│
Wordrec::SegSearch			│
Wordrec::chop_word_main			│
Wordrec::cc_recog			│
Tesseract::recog_word_recursive		Tesseract::SearchWords
Tesseract::recog_word			Tesseract::LSTMRecognizeWord
Tesseract::tess_segment_pass_n		│
Tesseract::match_word_pass_n		│
Tesseract::classify_word_pass1		┘
Tesseract::classify_word_and_language
Tesseract::RecogAllWordsPassN
Tesseract::recog_all_words
TessBaseAPI::Recognize
  • max_permuter_attempts with calltree:
Dict::dawg_permute_and_select
Dict::NoDangerousAmbig
LanguageModel::ConstructWord
LanguageModel::UpdateBestChoice
LanguageModel::AddViterbiStateEntry
LanguageModel::UpdateState
Wordrec::UpdateSegSearchNodes
Wordrec::InitialSegSearch
Wordrec::SegSearch
Wordrec::chop_word_main
Wordrec::cc_recog
Tesseract::recog_word_recursive
Tesseract::recog_word
Tesseract::tess_segment_pass_n
Tesseract::match_word_pass_n
Tesseract::classify_word_pass1
Tesseract::classify_word_and_language
Tesseract::RecogAllWordsPassN
Tesseract::recog_all_words
TessBaseAPI::Recognize
  • tessedit_enable_dict_correction with calltree:
Tesseract::dictionary_correction_pass
WERD_RES::best_choices
WERD_CHOICE_IT::data
Dict::valid_word

So: stopper_nondict_certainty_base and max_permuter_attempts could be bridged as well, tessedit_enable_dict_correction would still need to find the choice iterator populated with alternatives (which I am not certain #2310 already accomplished), but language_model_penalty_non_dict_word has no direct LSTM correspondence (and I am not sure how to make a useful new parameter out of kWorstDictCertainty and kDictRatio).

Should this PR be merged now as a partial fix?

If nobody has an objection, yes.

@zdenop
Copy link
Contributor

zdenop commented Mar 24, 2019

@stweil : What is your opinion?

@stweil stweil merged commit 58423d2 into tesseract-ocr:master Mar 24, 2019
@stweil
Copy link
Member

stweil commented Mar 24, 2019

Merged now. Thank you, Robert!

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jul 5, 2019

Added a wiki example for API usage at https://github.com/tesseract-ocr/tesseract/wiki/APIExample-user_patterns

@bertsky
Copy link
Contributor Author

bertsky commented Jul 5, 2019

@Shreeshrii splendid, thanks!

BTW, @noahmetzger nearly finished a patch that will bring us good beam search alternatives. This will yield much better user words/patterns. We could even consider adding a restrictive mode (like tessedit_enable_dict_correction) then, because we would not risk getting empty results.

@Shreeshrii
Copy link
Collaborator

@noahmetzger Thanks for the PR.

@bertsky Please make the required changes for better user words/patterns.

See a request from user just today - https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/-DxOrS1f4sc/Nnr7ZV6jBwAJ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants