param `char_whitelist` for `Text::OCRTesseract::create()` should be an empty string instead of null which fallbacks to `[0-9a-zA-Z]` #3457

n0099 · 2023-03-09T12:11:30Z

System information (version)

OpenCV => 4.7.0
Operating System / Platform => Windows 8.1 and Ubuntu 22.04
Compiler => ❔

Detailed description

opencv_contrib/modules/text/include/opencv2/text/ocr.hpp

Lines 156 to 157 in ed1873b

    
               @param char_whitelist specifies the list of characters used for recognition. NULL defaults to 
        
               "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".

This behavior spends me hours on figuring out why using Tesseract to recognize CJK chars is working on Emgu.CV but not OpenCvSharp:
shimat/opencvsharp#1542
shimat/opencvsharp#873
shimat/opencvsharp#1364

Steps to reproduce

Issue submission checklist

I report the issue, it's not a question
I checked the problem with documentation, FAQ, open issues,
forum.opencv.org, Stack Overflow, etc and have not found any solution
I updated to the latest OpenCV version and the issue is still there
There is reproducer code and related data files: videos, images, onnx, etc

The text was updated successfully, but these errors were encountered:

Kumataro · 2023-03-18T03:56:44Z

Hi, with OpenCV 4.x/3.4, following 2 statement will occur different outputs. I feel this is hard to exprain how to use a little.

https://docs.opencv.org/4.7.0/d7/ddc/classcv_1_1text_1_1OCRTesseract.html#a391b1e753f0b779b72204ec15200a99a

char_whitelist
specifies the list of characters used for recognition.
NULL defaults to "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".

I ran test program.

ocr = cv::text::OCRTesseract::create(NULL, "jpn") ;
-> whitelist is [0-9a-zA-z]. (Like as documents).
ocr = cv::text::OCRTesseract::create(NULL, "jpn", "") ;
-> whitelist is disabled. (This is not described in document.)

There are two possible countermeasures.

Option 1) Add document that whitelist can be disabled by specifying "" instead to NULL and user specified whitelist.
Option 2) Fix Program to disable whitelist If lang != "eng" (or NULL) and whitelist is NULL. (And add documents...)

Which option is better ?

Environment

kmtr@ubuntu:~/work/study-c3457$ uname -a
Linux ubuntu 4.15.0-142-generic #146~16.04.1-Ubuntu SMP Tue Apr 13 09:26:57 UTC 2021 i686 i686 i686 GNU/Linux
kmtr@ubuntu:~/work/study-c3457$ tesseract -v
tesseract 3.04.01
 leptonica-1.73
  libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.2

OpenCV/OpenCV_contrib is at 3.4 branch(2023/3/18).

Sample code

// g++ main.cpp -o a.out  -l opencv_core  -l opencv_imgcodecs  -l opencv_text

#include <opencv2/core.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/text.hpp>
#include <iostream>

void trial(cv::Mat &img, char *lang, char* whitelist )
{
  cv::Ptr<cv::text::OCRTesseract> ocr =
    cv::text::OCRTesseract::create(NULL, lang, whitelist) ;

  std::string text;
  ocr->run(img, text);

  if ( lang == NULL ) {
      std::cout << "[INPUT ] lang = NULL";
  }else{
      std::cout << "[INPUT ] lang = \"" << lang << "\"";
  }
  if ( whitelist == NULL ) {
      std::cout << " whitelist = NULL" << std::endl;
  }else{
      std::cout << " whitelist = \"" << whitelist << "\"" << std::endl;
  }
  std::cout << "[OUTPUT] result is " << text << std::endl;
}

int main(void)
{
  cv::Mat img = cv::imread("MPLUS1.JPG",1);
  trial(img, NULL,             NULL);
  trial(img, (char*)"eng",     NULL);
  trial(img, (char*)"jpn",     NULL);
  trial(img, (char*)"eng+jpn", NULL);
  trial(img, NULL,             (char*)"0123456789abcdefghijklmnopqrstuvwxyz");
  trial(img, (char*)"eng",     (char*)"0123456789abcdefghijklmnopqrstuvwxyz");
  trial(img, (char*)"jpn",     (char*)"0123456789abcdefghijklmnopqrstuvwxyz");
  trial(img, (char*)"eng+jpn", (char*)"0123456789abcdefghijklmnopqrstuvwxyz");
  trial(img, NULL,             (char*)"");
  trial(img, (char*)"eng",     (char*)"");
  trial(img, (char*)"jpn",     (char*)"");
  trial(img, (char*)"eng+jpn", (char*)"");
  return 0;
}

Test image

Result

$ ./a.out
[INPUT ] lang = NULL whitelist = NULL
[OUTPUT] result is 2023l03l18 CIVKZBKL 113510 Hello World


[INPUT ] lang = "eng" whitelist = NULL
[OUTPUT] result is 2023l03l18 CIVKZBKL 113510 Hello World


[INPUT ] lang = "jpn" whitelist = NULL
[OUTPUT] result is 2023 03 ー8   C    。 HeH0 W。r d


[INPUT ] lang = "eng+jpn" whitelist = NULL
[OUTPUT] result is 2023l03l18 CIVKZBKL 113510 Hello World


[INPUT ] lang = NULL whitelist = "0123456789abcdefghijklmnopqrstuvwxyz"
[OUTPUT] result is 2023l03l18 llvli510x 113510 klello world


[INPUT ] lang = "eng" whitelist = "0123456789abcdefghijklmnopqrstuvwxyz"
[OUTPUT] result is 2023l03l18 llvli510x 113510 klello world


[INPUT ] lang = "jpn" whitelist = "0123456789abcdefghijklmnopqrstuvwxyz"
[OUTPUT] result is 2023 03 ー8   g    。  e 0 w0r d


[INPUT ] lang = "eng+jpn" whitelist = "0123456789abcdefghijklmnopqrstuvwxyz"
[OUTPUT] result is 2023l03l18 llvli510x 113510 klello world


[INPUT ] lang = NULL whitelist = ""
[OUTPUT] result is 2023/03/18 CNKZBKL 1135?.) Hello,World.


[INPUT ] lang = "eng" whitelist = ""
[OUTPUT] result is 2023/03/18 CNKZBKL 1135?.) Hello,World.


[INPUT ] lang = "jpn" whitelist = ""
[OUTPUT] result is 2023/03/ー8 こんにちわヽ 世界。 He=。,w。rーd.


[INPUT ] lang = "eng+jpn" whitelist = ""
[OUTPUT] result is 2023/03/18 こんにちわヽ 世界。 Hello,World.

n0099 · 2023-03-18T05:24:44Z

Thanks for your comprehensive sample to represent this issue.

Personally I would prefer option 2 since the user might not even notice this fallback behavior before they encounter images without any Latin characters recognized as mystery alphanumeric soup if we just warn about this in the OpenCV document. Also for users using wrapper library in languages other than C/C++ and python, they might not check the official and latest OpenCV document very often, and Google is still indexing documents on the 3.x branch).

But this option basically is a BC breaking change, I don't know any details about the policy to introduce BC break and version number control for OpenCV, so I'm curious if this option is approved, will it only affect the next major version which is 5.0?

Edit: I've found some historical info about versioning:
breaking in 3.0: https://docs.opencv.org/4.x/db/dfa/tutorial_transition_guide.html
4.0 have some minor changes and deprecation for some legacy C api(which will be totally removed in 5.0): https://stackoverflow.com/questions/53906178/how-opencv-4-x-api-is-different-from-previous-version

Kumataro · 2023-03-21T00:42:36Z

About breaking change

Many libraries (including OpenCV) basically don't want breaking changes in minor version upgrades.
If we want to introduce breakig changes, it is good to work at chaining major version upgrade.

Applications should avoid worrying about library versions as much as possible.

From time to time, for some reason the library needs to be updated.
In this time, it is a big inconvenience for the library user(application) that the compilation cannot be done or the calculation result is changed.

So if we change the interface or change the calculation results, we need a compelling reason.

Is default [0-9a-zA-Z] good for any language?

I verified the contents of the dictionary data in tesseract.

For example: https://github.com/tesseract-ocr/langdata/blob/main/eng/eng.wordlist

Currently text module implemantation doen't accept those wordlist as default char_whitelist.

A ligature version is also registered. For example, variations of "ﬆ", which is a combination of "s" and "t", are also registered.

Master
Maﬆer

A shortened version is also registered.

people's / People's

Versions with prefixes and suffixes are also registered.

people, / people.
people? / people! / people)
"People / People: / People.

Currently wordlists contains non-[0-9a-zA-Z] characters, so I think it's difficult to technically explain why char_whitelist defaults to [0-9a-zA-Z].

I agree with this issue's proposal. I propose to set default char_whitelist to ""(null strings) from OpenCV 4.8.0/3.20.0 ~~3.19.0~~ (even if it breaks backwards compatibility between minor versions).

I believe this will improve recognition accuracy not only for languages containing non-ASCII characters such as CJK, but also for English (especially for sentences).

n0099 · 2023-03-21T04:09:26Z

LGTM

n0099 · 2023-03-21T04:21:21Z

Currently text module implemantation doen't accept those wordlist as default char_whitelist.

In fact passing the empty string as the value of param char_whitelist just means using all available chars in the worldlist from langdata, since no chars outside the worldlist will be recognized

set default char_whitelist to ""(null strings) from OpenCV 4.8.0/3.19.0 (even if it breaks backwards compatibility between minor versions).

Will there be any new minor version released for 3.x branch?

Kumataro · 2023-03-21T05:55:19Z

Hi, I tried to make PR.

The text module seems that no test for character recognition.

https://github.com/opencv/opencv_contrib/tree/3.4/modules/text/test

Knowing the installed language data for character recognition is a prerequisite for conducting tests.
But, current text module does not seem to have this function.

Supporting some test implementation of the text module is likely to be more difficult than writing this patch.

Will there be any new minor version released for 3.x branch?

Curently milestone is here https://github.com/opencv/opencv/milestones

It seems that those release milestones are planed.

5.0
4.8.0 = master
3.4.20

Version 3.4 branches are used for only bug-fix, not for implementation new features.

https://github.com/opencv/opencv/wiki/ChangeLog#version3419

ucool-wu · 2023-03-30T08:37:42Z

This problem also makes me waste several hours!

Kumataro · 2023-03-30T21:03:46Z

#3462

n0099 · 2023-05-03T18:52:34Z

I've found out this fallback has been made in the first time of introducing OCRTesseract nine years ago: 5c89c78#diff-141682c94db0e250c47fdd7743c4d35f6dc0734e5d333e7e5b0f2a819548bd3bR82, affecting from OpenCV 3.0.0-beta to 4.7.0.
What a longstanding feature it is!

Kumataro mentioned this issue Mar 21, 2023

text: change default char_whitelist parameter. #3462

Merged

6 tasks

asmorkalov closed this as completed Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

param `char_whitelist` for `Text::OCRTesseract::create()` should be an empty string instead of null which fallbacks to `[0-9a-zA-Z]` #3457

param `char_whitelist` for `Text::OCRTesseract::create()` should be an empty string instead of null which fallbacks to `[0-9a-zA-Z]` #3457

n0099 commented Mar 9, 2023

Kumataro commented Mar 18, 2023

n0099 commented Mar 18, 2023 •

edited

Loading

Kumataro commented Mar 21, 2023 •

edited

Loading

n0099 commented Mar 21, 2023

n0099 commented Mar 21, 2023

Kumataro commented Mar 21, 2023

ucool-wu commented Mar 30, 2023

Kumataro commented Mar 30, 2023

n0099 commented May 3, 2023

param char_whitelist for Text::OCRTesseract::create() should be an empty string instead of null which fallbacks to [0-9a-zA-Z] #3457

param char_whitelist for Text::OCRTesseract::create() should be an empty string instead of null which fallbacks to [0-9a-zA-Z] #3457

Comments

n0099 commented Mar 9, 2023

System information (version)

Detailed description

Steps to reproduce

Issue submission checklist

Kumataro commented Mar 18, 2023

Environment

Sample code

Test image

Result

n0099 commented Mar 18, 2023 • edited Loading

Kumataro commented Mar 21, 2023 • edited Loading

About breaking change

Is default [0-9a-zA-Z] good for any language?

n0099 commented Mar 21, 2023

n0099 commented Mar 21, 2023

Kumataro commented Mar 21, 2023

ucool-wu commented Mar 30, 2023

Kumataro commented Mar 30, 2023

n0099 commented May 3, 2023

param `char_whitelist` for `Text::OCRTesseract::create()` should be an empty string instead of null which fallbacks to `[0-9a-zA-Z]` #3457

param `char_whitelist` for `Text::OCRTesseract::create()` should be an empty string instead of null which fallbacks to `[0-9a-zA-Z]` #3457

n0099 commented Mar 18, 2023 •

edited

Loading

Kumataro commented Mar 21, 2023 •

edited

Loading