Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trying to add tessedit_char_whitelist etc. again: #2294

Merged
merged 8 commits into from
Apr 6, 2019

Conversation

bertsky
Copy link
Contributor

@bertsky bertsky commented Mar 7, 2019

  • ignore matrix outputs in ComputeTopN if they
    belong to a disabled unichar_id
  • pass UNICHARSET refs to check that
  • in SetBlackAndWhitelist, also update the unicharset
    of the lstm_recognizer_ instance, if any

- ignore matrix outputs in ComputeTopN if they
  belong to a disabled unichar_id
- pass UNICHARSET refs to check that
- in SetBlackAndWhitelist, also update the unicharset
  of the lstm_recognizer_ instance, if any
@zdenop
Copy link
Contributor

zdenop commented Mar 7, 2019

sub_langs_ will be handled withing this PR (should be merged now or should we wait for additional commit)?

@bertsky
Copy link
Contributor Author

bertsky commented Mar 7, 2019

@zdenop, I will add that to the PR of course. It should be ready by tomorrow.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 7, 2019

So are there no objections to the way I use RecodedCharID (label mapping) and UnicharCompress (recoder)? I feel a little uncomfortable here.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 7, 2019

It works with sublangs now. But I just saw it does not work with Chinese characters yet. So I guess my RecodedCharID attempt is indeed not enough...

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 8, 2019

@Shreeshrii
Copy link
Collaborator

Please also test that lstmtraining, lstmeval and other training programs work with the changes.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 8, 2019

Thanks, @Shreeshrii, I will.

@amitdo
Copy link
Collaborator

amitdo commented Mar 8, 2019

Maybe you should remove the recoding part until you figure out how to make it work?

@bertsky
Copy link
Contributor Author

bertsky commented Mar 8, 2019

Well it does work for simple characters, but not for things like CJK. And I cannot avoid using it – this is the component mediating between unicharset ids and output channels. But I am positive I will understand it by looking at the unittests and training tools.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 8, 2019

The output of unicharcompress/recoder can be seen by the following:

rm -rf ~/tesstutorial/kortest
bash  ~/tesseract/src/training/tesstrain.sh \
  --fonts_dir /usr/share/fonts \
  --lang kor \
  --linedata_only \
  --save_box_tiff \
  --workspace_dir ~/tmp \
  --exposures "0" \
  --maxpages 1 \
  --noextract_font_properties \
  --langdata_dir ~/langdata_lstm \
  --tessdata_dir ~/tessdata_best  \
  --fontlist "Arial Unicode MS" \
  --training_text ~/langdata_lstm/kor/kor.training_text \
  --output_dir ~/tesstutorial/kortest

Using 1 page of training_text input, it creates a unicharset of 659 which is compressed to 111 using radical_stroke.txt.

659
NULL 0 NULL 0
Joined 7 0,69,188,255,486,1218,0,30,486,1188 Latin 1 0 1 Joined	# Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,69,186,255,892,2138,0,80,892,2058 Common 216 10 216 |Broken|0|1	# Broken
~ 0 111,154,139,180,90,124,4,19,100,155 Common 3 10 3 ~	# ~ [7e ]
에 1 28,63,223,255,160,206,0,31,218,230 Hangul 4 0 4 에	# 에 [c5d0 ]x
서 1 28,63,224,255,169,196,0,24,218,230 Hangul 5 0 5 서	# 서 [c11c ]x
는 1 46,75,213,255,185,204,6,22,218,230 Hangul 6 0 6 는	# 는 [b294 ]x
지 1 28,63,223,255,169,194,3,29,218,230 Hangul 7 0 7 지	# 지 [c9c0 ]x
내 1 30,63,226,255,160,186,8,34,218,230 Hangul 8 0 8 내	# 내 [b0b4 ]x
고 1 58,87,198,253,179,203,6,24,218,230 Hangul 9 0 9 고	# 고 [ace0 ]x
났 1 30,63,221,255,190,211,13,17,218,230 Hangul 10 0 10 났	# 났 [b0ac ]x
습 1 30,67,219,255,185,204,6,22,218,230 Hangul 11 0 11 습	# 습 [c2b5 ]x
니 1 30,65,223,255,152,182,13,39,218,230 Hangul 12 0 12 니	# 니 [b2c8 ]x
다 1 27,63,221,255,186,203,13,25,218,230 Hangul 13 0 13 다	# 다 [b2e4 ]x
후 1 27,64,223,255,181,203,5,24,218,230 Hangul 14 0 14 후	# 후 [d6c4 ]x
촉 1 28,63,224,255,179,204,6,25,218,230 Hangul 15 0 15 촉	# 촉 [cd09 ]x
진 1 40,69,223,255,171,203,0,27,218,230 Hangul 16 0 16 진	# 진 [c9c4 ]x
- 10 103,154,114,173,55,157,14,19,86,195 Common 17 3 17 -	# - [2d ]p
0	 
110	<nul>
1	~
62,74,90	에
60,73,90	서
53,87,93	는
63,89,90	지
53,71,90	내
51,77,90	고
53,70,103	났
60,87,100	습
53,89,90	니
54,70,90	다
69,82,90	후
65,77,91	촉
63,89,93	진
2	-

- move decision from ComputeTopN to ContinueContext, where
  it belongs: block context continuations which emit final
  codes translating to disabled unichar_ids.
  (The normal logic for fallback from top2 > top2 > rest
   will apply.)
- pass UNICHARSET refs appropriately
@bertsky
Copy link
Contributor Author

bertsky commented Mar 8, 2019

Found it. I believe I do understand now. It does work for Chinese characters, too. I am pretty sure it works for all other cases as well.

@Shreeshrii, I very much doubt that training will be affected, as this is a runtime variable. Also, I have never done training before. What would be required? The lstmtraining tutorial of the training page in the wiki? How do I tell it still works after the process?

@bertsky
Copy link
Contributor Author

bertsky commented Mar 8, 2019

@stweil, do you want me to squash the three commits into one (with a better changelog) on this PR?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 8, 2019

@bertsky I applied the PR to my local installation and I am getting the following error with lstmeval. recode beam search is used by training programs also and so some more changes may be required for that.

(gdb) run
Starting program: /usr/local/bin/lstmeval --verbosity -1 --model /home/ubuntu/tesstutorial/IAST_PLUS/plusminus_checkpoint --traineddata /home/ubuntu/tesstutorial/IAST_PLUS/iast/iast.traineddata --eval_listfile /home/ubuntu/tesstutorial/IAST_eval_1/iast.training_files.txt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
/home/ubuntu/tesstutorial/IAST_PLUS/plusminus_checkpoint is not a recognition model, trying training checkpoint...
[New Thread 0x3fffb6e5f100 (LWP 30849)]
Loaded 116/116 pages (1-116) of document /home/ubuntu/tesstutorial/IAST_eval_1/iast.Gentium_Basic_Italic.exp0.lstmf
[Thread 0x3fffb6e5f100 (LWP 30849) exited]
[New Thread 0x3fffb6e5f100 (LWP 30850)]
Loaded 116/116 pages (1-116) of document /home/ubuntu/tesstutorial/IAST_eval_1/iast.Gentium_Book_Basic_Italic.exp0.lstmf
[Thread 0x3fffb6e5f100 (LWP 30850) exited]
[New Thread 0x3fffb6e5f100 (LWP 30851)]
[New Thread 0x3fffb665f100 (LWP 30852)]
[New Thread 0x3fffb5e5f100 (LWP 30853)]

Thread 1 "lstmeval" received signal SIGSEGV, Segmentation fault.
0x00003fffb7dc76b4 in UNICHARSET::get_enabled (this=0x0, unichar_id=<optimized out>) at ../../src/ccutil/unicharset.h:874
874         return unichars[unichar_id].properties.enabled;
(gdb) backtrace
#0  0x00003fffb7dc76b4 in UNICHARSET::get_enabled (this=0x0, unichar_id=<optimized out>) at ../../src/ccutil/unicharset.h:874
#1  tesseract::RecodeBeamSearch::ContinueContext (this=0x11078440, prev=0x0, index=<optimized out>, outputs=0x181d1eb0, top_n_flag=<optimized out>, charset=0x0,
    dict_ratio=1, cert_offset=0, worst_dict_cert=-20, step=0x164c63c0) at recodebeam.cpp:629
#2  0x00003fffb7dc7ec4 in tesseract::RecodeBeamSearch::DecodeStep (this=0x11078440, outputs=0x181d1eb0, t=<optimized out>, dict_ratio=1, cert_offset=0,
    worst_dict_cert=-20, charset=0x0, debug=<optimized out>) at recodebeam.cpp:501
#3  0x00003fffb7dc849c in tesseract::RecodeBeamSearch::Decode (this=0x11078440, output=..., dict_ratio=1, cert_offset=0, worst_dict_cert=-20, charset=0x0,
    lstm_choice_mode=<optimized out>) at recodebeam.cpp:92
#4  0x00003fffb7da32dc in tesseract::LSTMRecognizer::LabelsViaReEncode (this=0x3fffffffdc80, output=..., labels=0x3fffffffd928, xcoords=0x3fffffffd948)
    at lstmrecognizer.cpp:436
#5  0x00003fffb7da4084 in tesseract::LSTMRecognizer::LabelsFromOutputs (this=0x3fffffffdc80, outputs=..., labels=0x3fffffffd928, xcoords=<optimized out>)
    at lstmrecognizer.cpp:423
#6  0x00003fffb7dabbd4 in tesseract::LSTMTrainer::PrepareForBackward (this=0x3fffffffdc78, trainingdata=0x3fffb0000910, fwd_outputs=0x3fffffffdb58,
    targets=0x3fffffffdbe8) at lstmtrainer.cpp:867
#7  0x000000001000a198 in tesseract::LSTMTester::RunEvalSync (this=0x3fffffffebc0, iteration=<optimized out>, training_errors=<optimized out>, model_mgr=...,
    training_stage=<optimized out>, verbosity=<optimized out>) at lstmtester.cpp:101
#8  0x0000000010004914 in main (argc=<optimized out>, argv=<optimized out>) at lstmeval.cpp:79
(gdb)

@Shreeshrii
Copy link
Collaborator

make check also fails on certain unitests.

PASS: qrsequence_test
../config/test-driver: line 107: 31583 Segmentation fault (core dumped) "$@" > $log_file 2>&1
FAIL: recodebeam_test
PASS: rect_test
PASS: resultiterator_test
PASS: shapetable_test
PASS: stats_test
PASS: stringrenderer_test
PASS: tablefind_test
PASS: tablerecog_test
PASS: tabvector_test
PASS: tfile_test
PASS: commandlineflags_test
../config/test-driver: line 107: 31801 Segmentation fault (core dumped) "$@" > $log_file 2>&1
FAIL: lstm_recode_test
../config/test-driver: line 107: 31825 Segmentation fault (core dumped) "$@" > $log_file 2>&1
FAIL: lstm_squashed_test
../config/test-driver: line 107: 31849 Segmentation fault (core dumped) "$@" > $log_file 2>&1
FAIL: lstm_test
PASS: lstmtrainer_test

@stweil
Copy link
Member

stweil commented Mar 8, 2019

@stweil, do you want me to squash the three commits into one (with a better changelog) on this PR?

@bertsky, first of all thank you for addressing this issue with your pull request. We can decide about squashing as soon as the changes are ready for the production code. Then squashing can optionally be done either by you or by the maintainer who merges the pull request.

I hope that I'll have finished some other work on Tesseract soon, then I can help with fixing the problems reported above by @Shreeshrii.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 8, 2019

@Shreeshrii thanks a lot! I will look into it. As for the lstmeval test, will I have to go through tesstutorial on the wiki for that?

I was not aware that make check is supposed to work at all. (Also I think it should be named make test as for most automake projects.) The problem is, it does not find the googletest submodule and thus fails to build for me. Is this a missing dependency? (googletest/googletest/src is empty.) And is this related to https://github.com/tesseract-ocr/test perhaps?

@Shreeshrii
Copy link
Collaborator

Try git submodule update --init --recursive to include googletest, test and abseil submodules when you build tesseract.

make check works but it has a lot of dependencies in terms of having all three repos of tessdata, tessdata_best, tessdata_fast and the test repo for the testdata files.

Regarding LSTM training - yes, tesstutorial is the way to go, though most people complain about it being difficult to follow. As you go through it, please note/make any changes that can be made to improve the wiki.

@zdenop It has been suggested to have a separate repo for tesstutorial with all required dependencies in it. I can upload the files for it if you create a new repo.

@Shreeshrii
Copy link
Collaborator

tesstutorial-eng.sh.txt

@bertsky Please see the attached file (bash script) which has the commands for tesstutorial (for the first few types of training). You can use it as a base, modify for your environment and test.

@stweil
Copy link
Member

stweil commented Mar 8, 2019

@bertsky, make check also expects clones of traineddatatessdata, traineddata_besttessdata_best and traineddata_fasttessdata_fast parallel to the clone of tesseract. Make sure that you have some GB of disk space available for that.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 8, 2019

@stweil, are you sure check depends on them (not the tesstutorial)? And didn't you mean tessdata, tessdata_best and tessdata_fast? And where would they have to be placed?

@bertsky
Copy link
Contributor Author

bertsky commented Mar 8, 2019

@Shreeshrii, thanks, that worked (after removing the existing directories, which had been filled with dependencies by configure already, and then re-instating them after cloning). I can see the segfaults now. I will address them first, maybe it's actually the same problem for lstmeval.

I think I should create another PR with some minimal notes in the dev section of CONTRIBUTING.md after this.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 8, 2019

Oh, right, just found where check expects the other repos. That's not such a nice way of maintaining dependencies! Wouldn't it be better to add them as submodules, too? Or at least provide some option to configure for where they are located?

@stweil
Copy link
Member

stweil commented Mar 8, 2019

Yes, sorry, the names were wrong. I fixed my comment now. The directory with the model files are required for the tests run with make check. The directories tessdata, ... are expected in the same directory hierarchy as the tesseract source tree. So it looks like in this example:

/home/debian/src/github/tesseract-ocr/langdata
/home/debian/src/github/tesseract-ocr/langdata_lstm
/home/debian/src/github/tesseract-ocr/tessdata
/home/debian/src/github/tesseract-ocr/tessdata_best
/home/debian/src/github/tesseract-ocr/tessdata_fast
/home/debian/src/github/tesseract-ocr/tesseract # (with submodules abseil, googletest and test)

Maybe langdata and langdata_lstm are needed, too.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 8, 2019

Why not integrate them as submodules overtly, just like test, googletest, and abseil?

@stweil
Copy link
Member

stweil commented Mar 8, 2019

The answer is simple: 8 GB. That's a lot of disk space on my notebook which I cannot afford. The model files are really large, and their git repository is even larger.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 21, 2019

It thus appears that README, tesstutorial and wiki are all failing to mention that lstm.train must be put there.

Won't the makefile here take care of that?
https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/configs/Makefile.am

I agree that the documentation can be further improved.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 21, 2019

Won't the makefile here take care of that?

You are right, make install should copy the file along with the others into TESSDATA_PREFIX.

The problem is that tesstutorial uses a clone of tessdata_best via --tessdata_dir, and README a clone of tessdata, which both obviously have no config files. And the wiki does not mention lstm.train either.

@Shreeshrii
Copy link
Collaborator

The problem is that tesstutorial uses a clone of tessdata_best via --tessdata_dir, and README a clone of tessdata, which both obviously have no config files. And the wiki does not mention lstm.train either.

In my local clones of tessdata_best, tessdata_fast and tessdata repos, I have copied the configs and other files from tesseract/tessdata, hence the problems are not obvious to me. Thanks for bringing it to fore.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 21, 2019

I think that the PR can be merged.
cc: @stweil @zdenop @amitdo

EDIT: Seems it is not working in all cases. See next comment.

@Shreeshrii
Copy link
Collaborator

@bertsky Please see issue #2159

Wiith tesseract digits.png stdout --psm 6 --dpi 300 -c lstm_choice_mode=2 hocr I get the correct value 1 as one of the choices - choice_1_1_4.

       <span class='ocrx_cinfo' id='lstm_choices_1_1_1'>
       <span class='ocr_glyph' id='choice_1_1_1' title='x_confs 53'>]</span>
       <span class='ocr_glyph' id='choice_1_1_2' title='x_confs 14'>i</span>
       <span class='ocr_glyph' id='choice_1_1_3' title='x_confs 11'>j</span>
       <span class='ocr_glyph' id='choice_1_1_4' title='x_confs 5'>1</span>
       <span class='ocr_glyph' id='choice_1_1_5' title='x_confs 4'>l</span>
       <span class='ocr_glyph' id='choice_1_1_6' title='x_confs 3'>J</span>
       <span class='ocr_glyph' id='choice_1_1_7' title='x_confs 2'>|</span>
       <span class='ocr_glyph' id='choice_1_1_8' title='x_confs 1'>}</span>
       <span class='ocr_glyph' id='choice_1_1_9' title='x_confs 1'>)</span>
       <span class='ocr_glyph' id='choice_1_1_10' title='x_confs 0'>.</span>
       <span class='ocr_glyph' id='choice_1_1_11' title='x_confs 0'>I</span>

But using tesseract digits.png stdout --psm 6 --dpi 300 -c lstm_choice_mode=2 -c tessedit_char_whitelist="0123456789" hocr OR tesseract digits.png stdout --psm 6 --dpi 300 -c lstm_choice_mode=2 hocr digits OR tesseract digits.png stdout --psm 6 --dpi 300 -c tessedit_char_whitelist="0123456789" hocr gives NO result whatsoever.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 4.1.0-rc1-131-g632a2' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "digits.png"; bbox 0 0 9 13; ppageno 0'>
  </div>
 </body>
</html>

@bertsky
Copy link
Contributor Author

bertsky commented Mar 23, 2019

Please see issue #2159

@Shreeshrii Thanks for pointing out, I now have. Essentially, if only digits are allowed, no alternatives besides null appear in the beam, because it is too narrow. So whitelisting itself does work correctly, but the latter issue is wider. We have seen it surface with user patterns, too. And of course it applies for lattice output.

BTW, the new lstm_choice_mode you are using here together with hocr output does not at all give correct data (both symbols and scores). It ignores incremental character encoding (UnicharCompress) and duplicate coding (NodeContinuation). Moreoever, in setting 1 it bypasses CTC altogether (so the user would have to do the decoding), and in settings 2 and 3 it merely yields the alternative character outputs along the best CTC path – without actually aggregating scores for the other paths (so only the best scores can be trusted).

Should I open a new issue about beam input narrowness, or can we discuss that in #2339?

@Shreeshrii
Copy link
Collaborator

the new lstm_choice_mode you are using here together with hocr output does not at all give correct data (both symbols and scores).

@bertsky You should bring it to the attention of @noahmetzger, @stweil, @amitdo. I do not know enough about tesseract internals to provide useful input.

settings 2 and 3 it merely yields the alternative character outputs along the best CTC path

For using whitelist it maybe enough to filter the alternative character outputs along the best CTC path. It might mean using a different algorithm rather than the one via the beam search.

@bertsky
Copy link
Contributor Author

bertsky commented Apr 1, 2019

You should bring it to the attention of @noahmetzger, @stweil, @amitdo. I do not know enough about tesseract internals to provide useful input.

I did that already, we have been in touch via email and phone.

I am going to prepare a special issue (to be referenced by the 4.1 planning RFC and planning wiki) focussing on the recurring big problems of the LSTMs: beam narrowness, confidence incompatibility/incommensurability with pre-LSTM models, and CTC diplopia. Hopefully, we can get some good discussion about how to go about these.

For using whitelist it maybe enough to filter the alternative character outputs along the best CTC path. It might mean using a different algorithm rather than the one via the beam search.

Maybe you are right. But it would not work for choice iterator with correct confidences and for user patterns.

@zdenop
Copy link
Contributor

zdenop commented Apr 5, 2019

@bertsky @Shreeshrii : so what is status of this PR?

@Shreeshrii
Copy link
Collaborator

@zdenop In my opinion, this can be merged because it does improve the results compared to current code. However, the related issues can still be left open since other changes are needed to fully fix the issue.

bertsky added 4 commits April 6, 2019 08:13
- with `set -e` in effect, it does not make sense
  to query `$?` indirectly
- with `set -e` in effect, looking at stdout
  to detect failure is too late
@stweil
Copy link
Member

stweil commented Apr 6, 2019

The pull request includes unrelated modifications for CONTRIBUTING.md and tesstrain_utils.sh. What about those changes, especially the 2nd one (is it still needed)?

@stweil stweil force-pushed the lstm-with-char-whitelist branch from 2e5c98c to f80508b Compare April 6, 2019 06:38
@ghost ghost assigned stweil Apr 6, 2019
@ghost ghost added the review label Apr 6, 2019
@stweil
Copy link
Member

stweil commented Apr 6, 2019

I just fixed the merge conflicts in this pull request.

@bertsky
Copy link
Contributor Author

bertsky commented Apr 6, 2019

@stweil, thanks for wrapping up the changesets.

The pull request includes unrelated modifications for CONTRIBUTING.md and tesstrain_utils.sh. What about those changes, especially the 2nd one (is it still needed)?

These are fallout from attempting to fix unittests and lstmtutorial along with validating this PR, which I added by request (see conversation above). Yes, they are still needed, but could also be split off.

@zdenop zdenop merged commit ab09b09 into tesseract-ocr:master Apr 6, 2019
@ghost ghost removed the review label Apr 6, 2019
@zdenop
Copy link
Contributor

zdenop commented Apr 6, 2019

thanks to all participants!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants