Failed to load any lstm-specific dictionaries for lang xxx #28

courao · 2018-09-11T11:51:04Z

I met this problem after training a model with OCRD,
in the terminal I input:
tesseract 5.2.tif output --psm 7 -l xxx
and I get this message:

Failed to load any lstm-specific dictionaries for lang tes!!
Tesseract Open Source OCR Engine v4.0.0-beta.4-138-g2093 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.

Anyone can help?

The text was updated successfully, but these errors were encountered:

kba · 2018-09-11T12:05:37Z

Is xxx a placeholder? Because the error message implies you're using a model tes.

Have you placed your model data in the TESSDATA directory for tesseract to find?

courao · 2018-09-11T12:11:50Z

xxx means tes here, it's a model I trained with ocrd,
actually it can give result for the input image in the output.txt file,
but I still want know the reason for this problem and how to avoid it.

courao · 2018-09-11T12:48:55Z

In addition,
I compared generated .traineddata with the original one:

coura@coura-pc:~/tess_test/ocrd-train$ combine_tessdata -d data/eng.traineddata  
Version string:4.00.00alpha:eng:synth20170629
17:lstm:size=401636, offset=192
18:lstm-punc-dawg:size=4322, offset=401828
19:lstm-word-dawg:size=3694794, offset=406150
20:lstm-number-dawg:size=4738, offset=4100944
21:lstm-unicharset:size=6360, offset=4105682
22:lstm-recoder:size=1012, offset=4112042
23:version:size=30, offset=4113054
coura@coura-pc:~/tess_test/ocrd-train$ combine_tessdata -d data/tes.traineddata Version string:4.0.0-beta.4-138-g2093
17:lstm:size=4063123, offset=192
21:lstm-unicharset:size=1034, offset=4063315
22:lstm-recoder:size=148, offset=4064349
23:version:size=22, offset=4064497

I find generated one seems missing some dawg files, how to add them?

kba · 2018-09-11T13:06:13Z

In addition to charset, traineddata can also contain information on punctuation, word lists etc. https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#40000alpha-lstm-only-format We don't currently support those. Tesseract tries to load the dict, fails but still continues the recognition (https://digi.bib.uni-mannheim.de/tesseract/doc/tesseract-ocr.github.io/4.00.00dev/a01046_source.html#l00131).

@wrznr wontfix or helpwanted?

wrznr · 2018-09-12T15:23:22Z

Neither! I guess we can fix this (i.e. support dictionaries). https://github.com/paalberti/tesseract-dan-fraktur/blob/master/deu_frak/buildscript.sh is a good starting point for augmenting the makefile.

marisancans · 2018-12-05T16:25:07Z

Does this in any way affect the recognition? If I don't have these dawg files and all other files.

wrznr · 2018-12-09T10:58:42Z

Of course, the recognition is heavily influenced by the existence (or non-existence) of dictionaries. Is this influence necessarily positive? I do not think so. Using dictionaries as hypotheses in text recognition bears the risk of introducing false positives (e.g. the German city Rust might be returned as the noun Rost if it is not in the dictionary). However, they were very important for the old, character-focused recognizer (tesseract version < 4) since they provided the necessary context for the single characters. With the line-focussed (lstm) approach, context information is implicitly provided by the model. Btw., I am not aware of any systematic evaluation of dictionary usage in OCR.

Shreeshrii · 2019-01-10T22:40:35Z

While finetuning, I usually rebuild the starter traineddata file and include the wordlists at that time so that they can be used for building the dawg files.
I have not tested the efficacy of OCR with vs without the dawg files.

Here is the section of bash script from a recent run for Arabic.

if [ $MergeData = "yes" ]; then

echo "#### combine_tessdata to extract lstm model from 'tessdata_best' for $BaseLang ####"
combine_tessdata -u $bestdata_dir/$BaseLang.traineddata $bestdata_dir/$BaseLang.
combine_tessdata -u $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang.

echo "#### build version string ####"
Version_Str="$Lang:ara`date +%Y%m%d`:from:"
sed -e "s/^/$Version_Str/" $bestdata_dir/$BaseLang.version > $layer_output_dir/$Lang.new.version

echo "#### This cleans out all previous checkpoints for training ####"
rm -rf $layered_output_dir
mkdir -p  $layered_output_dir

echo "#### merge unicharsets to ensure all existing chars are included ####"
merge_unicharsets \
$bestdata_dir/$BaseLang.lstm-unicharset \
$langdata_dir/$Lang/$Lang.zwnj.unicharset \
$layer_output_dir/$Lang/$Lang.unicharset \
$layered_output_dir/$Lang.continue.unicharset

echo "#### rebuild starter traineddata using the merged unicharset ####"
combine_lang_model \
--input_unicharset    $layered_output_dir/$Lang.continue.unicharset \
--script_dir $langdata_dir \
--words $langdata_dir/$Lang/$Lang.wordlist \
--numbers $langdata_dir/$Lang/$Lang.numbers \
--puncs $langdata_dir/$Lang/$Lang.punc \
--output_dir $layered_output_dir \
--pass_through_recoder \
--lang_is_rtl \
--lang $Lang \
--version_str ` cat $layer_output_dir/$Lang.new.version`

fi

trinitybest · 2019-01-31T05:21:18Z

Hi everyone, I have have the same issue:
Failed to load any lstm-specific dictionaries for lang modelb!! Warning. Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 575
Any update on this please?

kba · 2019-01-31T07:58:09Z

If you need word/number/punctuation lists and have them available, you could adapt https://github.com/OCR-D/ocrd-train/blob/master/Makefile#L123-L127. See @Shreeshrii's sample code above.

cooleel · 2019-03-29T13:52:20Z

While finetuning, I usually rebuild the starter traineddata file and include the wordlists at that time so that they can be used for building the dawg files.
I have not tested the efficacy of OCR with vs without the dawg files.

Here is the section of bash script from a recent run for Arabic.

if [ $MergeData = "yes" ]; then

echo "#### combine_tessdata to extract lstm model from 'tessdata_best' for $BaseLang ####"
combine_tessdata -u $bestdata_dir/$BaseLang.traineddata $bestdata_dir/$BaseLang.
combine_tessdata -u $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang.

echo "#### build version string ####"
Version_Str="$Lang:ara`date +%Y%m%d`:from:"
sed -e "s/^/$Version_Str/" $bestdata_dir/$BaseLang.version > $layer_output_dir/$Lang.new.version

echo "#### This cleans out all previous checkpoints for training ####"
rm -rf $layered_output_dir
mkdir -p  $layered_output_dir

echo "#### merge unicharsets to ensure all existing chars are included ####"
merge_unicharsets \
$bestdata_dir/$BaseLang.lstm-unicharset \
$langdata_dir/$Lang/$Lang.zwnj.unicharset \
$layer_output_dir/$Lang/$Lang.unicharset \
$layered_output_dir/$Lang.continue.unicharset

echo "#### rebuild starter traineddata using the merged unicharset ####"
combine_lang_model \
--input_unicharset    $layered_output_dir/$Lang.continue.unicharset \
--script_dir $langdata_dir \
--words $langdata_dir/$Lang/$Lang.wordlist \
--numbers $langdata_dir/$Lang/$Lang.numbers \
--puncs $langdata_dir/$Lang/$Lang.punc \
--output_dir $layered_output_dir \
--pass_through_recoder \
--lang_is_rtl \
--lang $Lang \
--version_str ` cat $layer_output_dir/$Lang.new.version`

fi

Thank you very much for your fantastic work on OCRD-train. I really appreciate it. I am doing tesseract ocr training recently and OCRD-train helped a lot.
I also met the same issue:
Failed to load any lstm-specific dictionaries for lang MYMODEL(placeholder)!!
and this sample code seems a bit confused to me. I assume $BaseLang is the one we started training, like in my case is the eng.traineddata, and the $Lang is our model. why we will have this $Lang.traineddata in bestdata_dir even before training? Or this piece of code should happen after training? the only purpose is to merge the data so we will have numbers/puncs/wordlist for our model? then what we should do is to move our trained model to $bestdata_dir if we want to follow the code?
also I am not quite sure about $layer_output_dir and $layered_output_dir? I know this sample code is in another scenario maybe, but could you please point out the equivalent directories in Makefile.

and for the starter traineddata(ex, eng.traineddata in my case), I have all these files ready lstm-punc-dawg/lstm-word-dawg/lstm-number-dawg/, seems Makefile has already automatically did this for us.

Any suggestions or hints would be greatly appreciated.

Shreeshrii · 2019-03-30T08:14:46Z

As mentioned in #28 (comment)
You need to modify the combine_lang_model command.

	combine_lang_model \
	  --input_unicharset data/unicharset \
	  --script_dir data/ \
	  --output_dir data/ \
	  --lang $(MODEL_NAME)

You need to add the following to the above with paths to where your wordlist, punc and numbers files are.

--words $langdata_dir/$Lang/$Lang.wordlist \
--numbers $langdata_dir/$Lang/$Lang.numbers \
--puncs $langdata_dir/$Lang/$Lang.punc \

The reason I had different BaseLang and Lang was-
BaseLang=Arabic
Lang=ara

Arabic.traineddata had better recognition for Arabic punctuation, so I wanted to use the lstm file from it. It covers both Arabic and English and I did not want to use the wordlists with both languages, hence I used the ones for ara.

combine_lang_model --help should display the syntax.

cooleel · 2019-04-08T19:18:55Z

As mentioned in #28 (comment)
You need to modify the combine_lang_model command.
	combine_lang_model \
	  --input_unicharset data/unicharset \
	  --script_dir data/ \
	  --output_dir data/ \
	  --lang $(MODEL_NAME)
You need to add the following to the above with paths to where your wordlist, punc and numbers files are.
--words $langdata_dir/$Lang/$Lang.wordlist \
--numbers $langdata_dir/$Lang/$Lang.numbers \
--puncs $langdata_dir/$Lang/$Lang.punc \
The reason I had different BaseLang and Lang was-
BaseLang=Arabic
Lang=ara

Arabic.traineddata had better recognition for Arabic punctuation, so I wanted to use the lstm file from it. It covers both Arabic and English and I did not want to use the wordlists with both languages, hence I used the ones for ara.

combine_lang_model --help should display the syntax.

It's working now. Thank you very much!

wrznr self-assigned this Sep 12, 2018

wrznr added the enhancement New feature or request label Sep 12, 2018

wrznr closed this as completed Aug 29, 2019

Jertlok mentioned this issue Dec 9, 2019

Training parts lists #131

Closed

livezingy mentioned this issue Apr 26, 2020

Failed Loading Language/Cannot find LSTM-specific dictionaries #155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to load any lstm-specific dictionaries for lang xxx #28

Failed to load any lstm-specific dictionaries for lang xxx #28

courao commented Sep 11, 2018

kba commented Sep 11, 2018

courao commented Sep 11, 2018

courao commented Sep 11, 2018 •

edited

Loading

kba commented Sep 11, 2018

wrznr commented Sep 12, 2018

marisancans commented Dec 5, 2018

wrznr commented Dec 9, 2018

Shreeshrii commented Jan 10, 2019

trinitybest commented Jan 31, 2019 •

edited

Loading

kba commented Jan 31, 2019

cooleel commented Mar 29, 2019 •

edited

Loading

Shreeshrii commented Mar 30, 2019

cooleel commented Apr 8, 2019

Failed to load any lstm-specific dictionaries for lang xxx #28

Failed to load any lstm-specific dictionaries for lang xxx #28

Comments

courao commented Sep 11, 2018

kba commented Sep 11, 2018

courao commented Sep 11, 2018

courao commented Sep 11, 2018 • edited Loading

kba commented Sep 11, 2018

wrznr commented Sep 12, 2018

marisancans commented Dec 5, 2018

wrznr commented Dec 9, 2018

Shreeshrii commented Jan 10, 2019

trinitybest commented Jan 31, 2019 • edited Loading

kba commented Jan 31, 2019

cooleel commented Mar 29, 2019 • edited Loading

Shreeshrii commented Mar 30, 2019

cooleel commented Apr 8, 2019

courao commented Sep 11, 2018 •

edited

Loading

trinitybest commented Jan 31, 2019 •

edited

Loading

cooleel commented Mar 29, 2019 •

edited

Loading