Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed Loading Language/Cannot find LSTM-specific dictionaries #155

Closed
TheSYNcoder opened this issue Apr 6, 2020 · 4 comments
Closed

Failed Loading Language/Cannot find LSTM-specific dictionaries #155

TheSYNcoder opened this issue Apr 6, 2020 · 4 comments
Labels
question Further information is requested stale Issues which require input by the reporter which is not provided

Comments

@TheSYNcoder
Copy link

I have been training a sample model TESS using tesstrain and the training went fine .
However after training when i move the /data/TESS/TESS.traineddata to /usr/local/share and run tesseract image.tif out -l TESS
I get the following error

Error: Tesseract (legacy) engine requested, but components are not present in /usr/local/share/tessdata/TESS.traineddata!!
Failed loading language 'TESS'

On the other hand , when i move the /data/TESS.traineddata it gives me the following error on running the same command :

Failed to load any lstm-specific dictionaries for lang TESS!!

Am i doing something wrong after the training ,can anyone please help , if it may help , here's my tesseract version

tesseract 5.0.0-alpha-648-gcdebe
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found OpenMP 201511
 Found libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
@Shreeshrii
Copy link
Collaborator

data/TESS/TESS.traineddata is the starter traineddata created with the unicharset from training text. It's size should be small. It can't be used for recognition.

data/TESS.traineddata is the traineddata after training. If you didn't have wordlist, you will get a warning about missing dictionary.

Check the timestamps and file sizes. The larger and later file will be your traineddata file.

@livezingy
Copy link

@TheSYNcoder

  1. the /data/TESS.traineddata should be the right one.
  2. The reason of the error: Failed to load any lstm-specific dictionaries for lang TESS!!
    Please refer to here:Failed to load any lstm-specific dictionaries for lang xxx

Although the WORDLIST_FILE/NUMBERS_FILE/PUNC_FILE are Optional in makefile, traineddata can also contain information on punctuation, word lists etc when training. If lack of these files ,the training traineddata will give this error when called.

  1. You may could try the following steps to solve it:

3.1 Find the WORDLIST_FILE/NUMBERS_FILE/PUNC_FILE in the makefile, and change them to:

WORDLIST_FILE := data/$(MODEL_NAME).wordlist
NUMBERS_FILE := data/$(MODEL_NAME).numbers 
PUNC_FILE := data/$(MODEL_NAME).punc

3.2 Suppose your base traineddata is eng.traineddata or your language is english.
Download the .wordlist/.numbers/.punc files from the tesseract-ocr/langdata_lstm/eng, and Rename them as TESS.wordlist, TESS.numbers, TESS.punc, then place them to /data/.

3.3 make training again.

@Shreeshrii
I think that there may be a bug about the WORDLIST_FILE/NUMBERS_FILE/PUNC_FILE in makefile.

In tesstrain, the default path of the above WORDLIST_FILE/NUMBERS_FILE/PUNC_FILE is $ (OUTPUT_DIR) = data / $ (MODEL_NAME), and all files in this path are automatically generated during the training process.

If the variable START_MODEL is not assigned, the makefile will not generate any related files under this path;

If the variable START_MODEL has been assigned, the foo.lstm-number-dawg、foo.lstm-punc-dawg、foo.lstm-word-dawg and so on will be produced in data / $ (MODEL_NAME). But they are not the right files the traineddata needed, the traineddata need the .wordlist/.numbers/.punc files. So there may be a bug in in tesstrain/makefile

Am I right Please?

@wrznr
Copy link
Collaborator

wrznr commented May 7, 2020

@TheSYNcoder Please move TESS.traineddata to /usr/local/share/tessdata/ (as indicated by the error message).

It is save to ignore the message Failed to load any lstm-specific dictionaries for lang TESS!!, dictionaries are an optional addition to tesseract models. Personally, I never use them when training my own models. I do not see any benefits.

@wrznr wrznr added the question Further information is requested label May 7, 2020
@stale
Copy link

stale bot commented Jun 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues which require input by the reporter which is not provided label Jun 6, 2020
@stale stale bot closed this as completed Jun 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale Issues which require input by the reporter which is not provided
Projects
None yet
Development

No branches or pull requests

4 participants