Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple monospace text not correctly interpretted #2820

Open
ahri opened this issue Dec 16, 2019 · 6 comments
Open

Simple monospace text not correctly interpretted #2820

ahri opened this issue Dec 16, 2019 · 6 comments

Comments

@ahri
Copy link

ahri commented Dec 16, 2019

Environment

  • Tesseract Version: 4.1.0
  • Platform: OSX High Sierra, Linux Ubuntu Eoan

Current Behavior:

Config file:

tessedit_char_whitelist 0123456789abcdef

Image:

avatar

Command-line usage:

$ tesseract avatar.png - --psm 7 --oem 1 tess_config
eea6d7bbe81

Expected Behavior:

eea6d7bbe581

i.e.

eea6d7bbe81 (incorrect)
vs.
eea6d7bbe581 (correct)

To elaborate; I would like to detect only a single hex number rendered in monospace in a PNG.

@ahri
Copy link
Author

ahri commented Dec 16, 2019

It looks like it's detecting the text as eea6d7bbeS81 and then stripping the S as it doesn't match the whitelist. I expected whitelisting to constrain the problem space and therefore make it easier for Tesseract to read, but this is clearly not the case.

@woodjohndavid
Copy link

I am also trying to use Tesseract to OCR random strings of letters and numbers mixed together. And I have the same general problem eas you describe, with Tesseract mixing up 'S' and '5' and also '1' and 'I'.

Tesseract is primarily designed to recognize words and determine what characters are present by what should be there for the word to be valid. So it doesn't naturally deal well with non-word strings.

The only suggestion I have is the following list of config file parameters that I am using to try to prevent Tesseract from using the word-matching method and instead just use a character by character recognition approach:

tessedit_flip_0O 0
load_system_dawg 0
load_freq_dawg 0
language_model_min_compound_length 1
language_model_penalty_increment 0.0
language_model_penalty_punc 0.0
language_model_penalty_spacing 0.0
language_model_penalty_script 0.0
language_model_penalty_non_dict_word 0.0
language_model_penalty_chartype 0.0
language_model_penalty_case 0.0
language_model_penalty_non_freq_dict_word 0.0

To be honest, I don't even know if this makes any difference, or whether the LSTM engine (which I am using) pays attention to these settings.

@ryanleonbutler
Copy link

Hi Tesseract-ocr Team,

I am facing similar challenges as @woodjohndavid.

I am try to recognise and extract a random 50+ character string (UID) from images that are uploaded in my workflow. They need to be 100% correct in order to find the correct UID. In my current OCR results, I get random spaces and incorrect characters being recognised as per @woodjohndavid's explanation. See below my example baseline image in order to get the best results:

baseline_test

My quick and dirty bash test (FYI: building a web app with Python that will be the finished product):

./bash_test.sh
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 373
OSD: Weak margin (0.83) for 52 blob text block, but using orientation anyway: 0
--------
Wrong!!!
--------
RFx1BGE6te9IoPS3Sx5GM-9WUBTwrVSzCR1IJzStRqhBwjvsZm25Kw== (Correct)
RFx1BGE6te9ToPS3Sx5GM—9WUBTwrVSzZCRIIJzStRqhBwj vsZm25Kw== (Result)

My bash script for testing:

#!/bin/bash
# Bash script to test tesseract-ocr

tesseract /path/to/image/baseline_test.png output --psm 1
# I have tried all psm options. 13 is actually the best with this type of single line image
text=$(cat output.txt)

# Remove spaces
# text=${text//[[:blank:]]/}

if [ "$text" == "RFx1BGE6te9IoPS3Sx5GM-9WUBTwrVSzCR1IJzStRqhBwjvsZm25Kw==" ];
then
    echo "########"
    echo "Correct."
    echo "########"
else
    echo "--------"
    echo "Wrong!!!"
    echo "--------"
    echo "RFx1BGE6te9IoPS3Sx5GM-9WUBTwrVSzCR1IJzStRqhBwjvsZm25Kw== (Correct)"
    echo ""${text}" (Result)"
    echo "========"
fi

Any thoughts or suggestions you might have pertaining to this issue? Is it possible for Tesseract-ocr to recognise these long UID's?

Looking forward to your response, thank you.

@amitdo
Copy link
Collaborator

amitdo commented Jan 28, 2020

tessedit_flip_0O 0
load_system_dawg 0
load_freq_dawg 0
language_model_min_compound_length 1
language_model_penalty_increment 0.0
language_model_penalty_punc 0.0
language_model_penalty_spacing 0.0
language_model_penalty_script 0.0
language_model_penalty_non_dict_word 0.0
language_model_penalty_chartype 0.0
language_model_penalty_case 0.0
language_model_penalty_non_freq_dict_word 0.0

To be honest, I don't even know if this makes any difference, or whether the LSTM engine (which I am using) pays attention to these settings.

tessedit_flip_0O and anything that start with language_model_ are ignored by the LSTM engine.

@amitdo
Copy link
Collaborator

amitdo commented Apr 24, 2020

@ahri,

about the whitelist issue.

#2760 (comment)

@bertsky
Copy link
Contributor

bertsky commented Mar 16, 2021

about the whitelist issue.

#2760 (comment)

The better reference would be here – the reason for the current behaviour of white/blacklisting – which is indeed of little practical use – is the narrowness of the default beam in the LSTM decoder. The lstm_choice_mode option (going deeper by creating different beams again and again) unfortunately does not help that. (It is only used for GetChoiceIterator, not to prevent null hypotheses when the user dict does not allow certain choices. Plus it only works for certain LSTM models.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants