Simple monospace text not correctly interpretted #2820

ahri · 2019-12-16T11:56:25Z

Environment

Tesseract Version: 4.1.0
Platform: OSX High Sierra, Linux Ubuntu Eoan

Current Behavior:

Config file:

tessedit_char_whitelist 0123456789abcdef

Image:

Command-line usage:

$ tesseract avatar.png - --psm 7 --oem 1 tess_config
eea6d7bbe81

Expected Behavior:

eea6d7bbe581

i.e.

eea6d7bbe81 (incorrect)
vs.
eea6d7bbe581 (correct)

To elaborate; I would like to detect only a single hex number rendered in monospace in a PNG.

The text was updated successfully, but these errors were encountered:

ahri · 2019-12-16T12:29:07Z

It looks like it's detecting the text as eea6d7bbeS81 and then stripping the S as it doesn't match the whitelist. I expected whitelisting to constrain the problem space and therefore make it easier for Tesseract to read, but this is clearly not the case.

woodjohndavid · 2019-12-19T20:24:57Z

I am also trying to use Tesseract to OCR random strings of letters and numbers mixed together. And I have the same general problem eas you describe, with Tesseract mixing up 'S' and '5' and also '1' and 'I'.

Tesseract is primarily designed to recognize words and determine what characters are present by what should be there for the word to be valid. So it doesn't naturally deal well with non-word strings.

The only suggestion I have is the following list of config file parameters that I am using to try to prevent Tesseract from using the word-matching method and instead just use a character by character recognition approach:

tessedit_flip_0O 0
load_system_dawg 0
load_freq_dawg 0
language_model_min_compound_length 1
language_model_penalty_increment 0.0
language_model_penalty_punc 0.0
language_model_penalty_spacing 0.0
language_model_penalty_script 0.0
language_model_penalty_non_dict_word 0.0
language_model_penalty_chartype 0.0
language_model_penalty_case 0.0
language_model_penalty_non_freq_dict_word 0.0

To be honest, I don't even know if this makes any difference, or whether the LSTM engine (which I am using) pays attention to these settings.

ryanleonbutler · 2020-01-08T20:39:26Z

Hi Tesseract-ocr Team,

I am facing similar challenges as @woodjohndavid.

I am try to recognise and extract a random 50+ character string (UID) from images that are uploaded in my workflow. They need to be 100% correct in order to find the correct UID. In my current OCR results, I get random spaces and incorrect characters being recognised as per @woodjohndavid's explanation. See below my example baseline image in order to get the best results:

My quick and dirty bash test (FYI: building a web app with Python that will be the finished product):

./bash_test.sh
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 373
OSD: Weak margin (0.83) for 52 blob text block, but using orientation anyway: 0
--------
Wrong!!!
--------
RFx1BGE6te9IoPS3Sx5GM-9WUBTwrVSzCR1IJzStRqhBwjvsZm25Kw== (Correct)
RFx1BGE6te9ToPS3Sx5GM—9WUBTwrVSzZCRIIJzStRqhBwj vsZm25Kw== (Result)

My bash script for testing:

#!/bin/bash
# Bash script to test tesseract-ocr

tesseract /path/to/image/baseline_test.png output --psm 1
# I have tried all psm options. 13 is actually the best with this type of single line image
text=$(cat output.txt)

# Remove spaces
# text=${text//[[:blank:]]/}

if [ "$text" == "RFx1BGE6te9IoPS3Sx5GM-9WUBTwrVSzCR1IJzStRqhBwjvsZm25Kw==" ];
then
    echo "########"
    echo "Correct."
    echo "########"
else
    echo "--------"
    echo "Wrong!!!"
    echo "--------"
    echo "RFx1BGE6te9IoPS3Sx5GM-9WUBTwrVSzCR1IJzStRqhBwjvsZm25Kw== (Correct)"
    echo ""${text}" (Result)"
    echo "========"
fi

Any thoughts or suggestions you might have pertaining to this issue? Is it possible for Tesseract-ocr to recognise these long UID's?

Looking forward to your response, thank you.

amitdo · 2020-01-28T11:31:01Z

tessedit_flip_0O 0
load_system_dawg 0
load_freq_dawg 0
language_model_min_compound_length 1
language_model_penalty_increment 0.0
language_model_penalty_punc 0.0
language_model_penalty_spacing 0.0
language_model_penalty_script 0.0
language_model_penalty_non_dict_word 0.0
language_model_penalty_chartype 0.0
language_model_penalty_case 0.0
language_model_penalty_non_freq_dict_word 0.0

To be honest, I don't even know if this makes any difference, or whether the LSTM engine (which I am using) pays attention to these settings.

tessedit_flip_0O and anything that start with language_model_ are ignored by the LSTM engine.

amitdo · 2020-04-24T16:05:33Z

@ahri,

about the whitelist issue.

#2760 (comment)

bertsky · 2021-03-16T18:37:14Z

about the whitelist issue.

#2760 (comment)

The better reference would be here – the reason for the current behaviour of white/blacklisting – which is indeed of little practical use – is the narrowness of the default beam in the LSTM decoder. The lstm_choice_mode option (going deeper by creating different beams again and again) unfortunately does not help that. (It is only used for GetChoiceIterator, not to prevent null hypotheses when the user dict does not allow certain choices. Plus it only works for certain LSTM models.)

amitdo added the allowlist / denylist label Oct 1, 2021

dpward mentioned this issue Mar 11, 2023

Build failure with leptonica 1.83 zdenop/qt-box-editor#87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple monospace text not correctly interpretted #2820

Simple monospace text not correctly interpretted #2820

ahri commented Dec 16, 2019 •

edited

Loading

ahri commented Dec 16, 2019

woodjohndavid commented Dec 19, 2019

ryanleonbutler commented Jan 8, 2020

amitdo commented Jan 28, 2020 •

edited

Loading

amitdo commented Apr 24, 2020

bertsky commented Mar 16, 2021

Simple monospace text not correctly interpretted #2820

Simple monospace text not correctly interpretted #2820

Comments

ahri commented Dec 16, 2019 • edited Loading

Environment

Current Behavior:

Expected Behavior:

ahri commented Dec 16, 2019

woodjohndavid commented Dec 19, 2019

ryanleonbutler commented Jan 8, 2020

amitdo commented Jan 28, 2020 • edited Loading

amitdo commented Apr 24, 2020

bertsky commented Mar 16, 2021

ahri commented Dec 16, 2019 •

edited

Loading

amitdo commented Jan 28, 2020 •

edited

Loading