-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to load any lstm-specific dictionaries for lang xxx #28
Comments
Is Have you placed your model data in the |
xxx means tes here, it's a model I trained with ocrd, |
In addition,
I find generated one seems missing some dawg files, how to add them? |
In addition to charset, traineddata can also contain information on punctuation, word lists etc. https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#40000alpha-lstm-only-format We don't currently support those. Tesseract tries to load the dict, fails but still continues the recognition (https://digi.bib.uni-mannheim.de/tesseract/doc/tesseract-ocr.github.io/4.00.00dev/a01046_source.html#l00131). @wrznr wontfix or helpwanted? |
Neither! I guess we can fix this (i.e. support dictionaries). https://github.com/paalberti/tesseract-dan-fraktur/blob/master/deu_frak/buildscript.sh is a good starting point for augmenting the makefile. |
Does this in any way affect the recognition? If I don't have these dawg files and all other files. |
Of course, the recognition is heavily influenced by the existence (or non-existence) of dictionaries. Is this influence necessarily positive? I do not think so. Using dictionaries as hypotheses in text recognition bears the risk of introducing false positives (e.g. the German city Rust might be returned as the noun Rost if it is not in the dictionary). However, they were very important for the old, character-focused recognizer (tesseract version < 4) since they provided the necessary context for the single characters. With the line-focussed (lstm) approach, context information is implicitly provided by the model. Btw., I am not aware of any systematic evaluation of dictionary usage in OCR. |
While finetuning, I usually rebuild the starter traineddata file and include the wordlists at that time so that they can be used for building the dawg files. Here is the section of bash script from a recent run for Arabic.
|
Hi everyone, I have have the same issue: |
If you need word/number/punctuation lists and have them available, you could adapt https://github.com/OCR-D/ocrd-train/blob/master/Makefile#L123-L127. See @Shreeshrii's sample code above. |
Thank you very much for your fantastic work on OCRD-train. I really appreciate it. I am doing tesseract ocr training recently and OCRD-train helped a lot. and for the starter traineddata(ex, eng.traineddata in my case), I have all these files ready lstm-punc-dawg/lstm-word-dawg/lstm-number-dawg/, seems Any suggestions or hints would be greatly appreciated. |
As mentioned in #28 (comment)
You need to add the following to the above with paths to where your wordlist, punc and numbers files are.
The reason I had different BaseLang and Lang was- Arabic.traineddata had better recognition for
|
It's working now. Thank you very much! |
I met this problem after training a model with OCRD,
in the terminal I input:
tesseract 5.2.tif output --psm 7 -l xxx
and I get this message:
Anyone can help?
The text was updated successfully, but these errors were encountered: