-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trying to add tessedit_char_whitelist etc. again: #2294
Conversation
- ignore matrix outputs in ComputeTopN if they belong to a disabled unichar_id - pass UNICHARSET refs to check that - in SetBlackAndWhitelist, also update the unicharset of the lstm_recognizer_ instance, if any
|
@zdenop, I will add that to the PR of course. It should be ready by tomorrow. |
So are there no objections to the way I use |
It works with sublangs now. But I just saw it does not work with Chinese characters yet. So I guess my |
Regarding recoder, please see the unittest This is used in https://github.com/tesseract-ocr/tesseract/blob/master/unittest/lstm_recode_test.cc#L38 |
Please also test that |
Thanks, @Shreeshrii, I will. |
Maybe you should remove the recoding part until you figure out how to make it work? |
Well it does work for simple characters, but not for things like CJK. And I cannot avoid using it – this is the component mediating between unicharset ids and output channels. But I am positive I will understand it by looking at the unittests and training tools. |
The output of unicharcompress/recoder can be seen by the following:
Using 1 page of training_text input, it creates a unicharset of 659 which is compressed to 111 using radical_stroke.txt.
|
- move decision from ComputeTopN to ContinueContext, where it belongs: block context continuations which emit final codes translating to disabled unichar_ids. (The normal logic for fallback from top2 > top2 > rest will apply.) - pass UNICHARSET refs appropriately
Found it. I believe I do understand now. It does work for Chinese characters, too. I am pretty sure it works for all other cases as well. @Shreeshrii, I very much doubt that training will be affected, as this is a runtime variable. Also, I have never done training before. What would be required? The lstmtraining tutorial of the training page in the wiki? How do I tell it still works after the process? |
@stweil, do you want me to squash the three commits into one (with a better changelog) on this PR? |
@bertsky I applied the PR to my local installation and I am getting the following error with lstmeval. recode beam search is used by training programs also and so some more changes may be required for that.
|
PASS: qrsequence_test |
@bertsky, first of all thank you for addressing this issue with your pull request. We can decide about squashing as soon as the changes are ready for the production code. Then squashing can optionally be done either by you or by the maintainer who merges the pull request. I hope that I'll have finished some other work on Tesseract soon, then I can help with fixing the problems reported above by @Shreeshrii. |
@Shreeshrii thanks a lot! I will look into it. As for the lstmeval test, will I have to go through tesstutorial on the wiki for that? I was not aware that |
Try
Regarding @zdenop It has been suggested to have a separate repo for |
@bertsky Please see the attached file (bash script) which has the commands for tesstutorial (for the first few types of training). You can use it as a base, modify for your environment and test. |
@bertsky, |
@stweil, are you sure |
@Shreeshrii, thanks, that worked (after removing the existing directories, which had been filled with dependencies by configure already, and then re-instating them after cloning). I can see the segfaults now. I will address them first, maybe it's actually the same problem for lstmeval. I think I should create another PR with some minimal notes in the dev section of |
Oh, right, just found where check expects the other repos. That's not such a nice way of maintaining dependencies! Wouldn't it be better to add them as submodules, too? Or at least provide some option to configure for where they are located? |
Yes, sorry, the names were wrong. I fixed my comment now. The directory with the model files are required for the tests run with
Maybe |
Why not integrate them as submodules overtly, just like |
The answer is simple: 8 GB. That's a lot of disk space on my notebook which I cannot afford. The model files are really large, and their git repository is even larger. |
Won't the makefile here take care of that? I agree that the documentation can be further improved. |
You are right, The problem is that tesstutorial uses a clone of |
In my local clones of tessdata_best, tessdata_fast and tessdata repos, I have copied the configs and other files from tesseract/tessdata, hence the problems are not obvious to me. Thanks for bringing it to fore. |
@bertsky Please see issue #2159 Wiith
But using
|
@Shreeshrii Thanks for pointing out, I now have. Essentially, if only digits are allowed, no alternatives besides null appear in the beam, because it is too narrow. So whitelisting itself does work correctly, but the latter issue is wider. We have seen it surface with user patterns, too. And of course it applies for lattice output. BTW, the new Should I open a new issue about beam input narrowness, or can we discuss that in #2339? |
@bertsky You should bring it to the attention of @noahmetzger, @stweil, @amitdo. I do not know enough about tesseract internals to provide useful input.
For using |
I did that already, we have been in touch via email and phone. I am going to prepare a special issue (to be referenced by the 4.1 planning RFC and planning wiki) focussing on the recurring big problems of the LSTMs: beam narrowness, confidence incompatibility/incommensurability with pre-LSTM models, and CTC diplopia. Hopefully, we can get some good discussion about how to go about these.
Maybe you are right. But it would not work for choice iterator with correct confidences and for user patterns. |
@bertsky @Shreeshrii : so what is status of this PR? |
@zdenop In my opinion, this can be merged because it does improve the results compared to current code. However, the related issues can still be left open since other changes are needed to fully fix the issue. |
- with `set -e` in effect, it does not make sense to query `$?` indirectly
- with `set -e` in effect, looking at stdout to detect failure is too late
The pull request includes unrelated modifications for |
2e5c98c
to
f80508b
Compare
I just fixed the merge conflicts in this pull request. |
@stweil, thanks for wrapping up the changesets.
These are fallout from attempting to fix unittests and lstmtutorial along with validating this PR, which I added by request (see conversation above). Yes, they are still needed, but could also be split off. |
thanks to all participants! |
belong to a disabled unichar_id
of the lstm_recognizer_ instance, if any