Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

param char_whitelist for Text::OCRTesseract::create() should be an empty string instead of null which fallbacks to [0-9a-zA-Z] #3457

Closed
3 of 4 tasks
n0099 opened this issue Mar 9, 2023 · 9 comments

Comments

@n0099
Copy link

n0099 commented Mar 9, 2023

System information (version)
  • OpenCV => 4.7.0
  • Operating System / Platform => Windows 8.1 and Ubuntu 22.04
  • Compiler => ❔
Detailed description

@param char_whitelist specifies the list of characters used for recognition. NULL defaults to
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".

This behavior spends me hours on figuring out why using Tesseract to recognize CJK chars is working on Emgu.CV but not OpenCvSharp:
shimat/opencvsharp#1542
shimat/opencvsharp#873
shimat/opencvsharp#1364

Steps to reproduce
Issue submission checklist
  • I report the issue, it's not a question
  • I checked the problem with documentation, FAQ, open issues,
    forum.opencv.org, Stack Overflow, etc and have not found any solution
  • I updated to the latest OpenCV version and the issue is still there
  • There is reproducer code and related data files: videos, images, onnx, etc
@Kumataro
Copy link
Contributor

Hi, with OpenCV 4.x/3.4, following 2 statement will occur different outputs. I feel this is hard to exprain how to use a little.

https://docs.opencv.org/4.7.0/d7/ddc/classcv_1_1text_1_1OCRTesseract.html#a391b1e753f0b779b72204ec15200a99a

char_whitelist
specifies the list of characters used for recognition.
NULL defaults to "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".

I ran test program.

  1. ocr = cv::text::OCRTesseract::create(NULL, "jpn") ;
    -> whitelist is [0-9a-zA-z]. (Like as documents).
  2. ocr = cv::text::OCRTesseract::create(NULL, "jpn", "") ;
    -> whitelist is disabled. (This is not described in document.)

There are two possible countermeasures.

  • Option 1) Add document that whitelist can be disabled by specifying "" instead to NULL and user specified whitelist.
  • Option 2) Fix Program to disable whitelist If lang != "eng" (or NULL) and whitelist is NULL. (And add documents...)

Which option is better ?

Environment

kmtr@ubuntu:~/work/study-c3457$ uname -a
Linux ubuntu 4.15.0-142-generic #146~16.04.1-Ubuntu SMP Tue Apr 13 09:26:57 UTC 2021 i686 i686 i686 GNU/Linux
kmtr@ubuntu:~/work/study-c3457$ tesseract -v
tesseract 3.04.01
 leptonica-1.73
  libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.2

OpenCV/OpenCV_contrib is at 3.4 branch(2023/3/18).

Sample code

// g++ main.cpp -o a.out  -l opencv_core  -l opencv_imgcodecs  -l opencv_text

#include <opencv2/core.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/text.hpp>
#include <iostream>

void trial(cv::Mat &img, char *lang, char* whitelist )
{
  cv::Ptr<cv::text::OCRTesseract> ocr =
    cv::text::OCRTesseract::create(NULL, lang, whitelist) ;

  std::string text;
  ocr->run(img, text);

  if ( lang == NULL ) {
      std::cout << "[INPUT ] lang = NULL";
  }else{
      std::cout << "[INPUT ] lang = \"" << lang << "\"";
  }
  if ( whitelist == NULL ) {
      std::cout << " whitelist = NULL" << std::endl;
  }else{
      std::cout << " whitelist = \"" << whitelist << "\"" << std::endl;
  }
  std::cout << "[OUTPUT] result is " << text << std::endl;
}

int main(void)
{
  cv::Mat img = cv::imread("MPLUS1.JPG",1);
  trial(img, NULL,             NULL);
  trial(img, (char*)"eng",     NULL);
  trial(img, (char*)"jpn",     NULL);
  trial(img, (char*)"eng+jpn", NULL);
  trial(img, NULL,             (char*)"0123456789abcdefghijklmnopqrstuvwxyz");
  trial(img, (char*)"eng",     (char*)"0123456789abcdefghijklmnopqrstuvwxyz");
  trial(img, (char*)"jpn",     (char*)"0123456789abcdefghijklmnopqrstuvwxyz");
  trial(img, (char*)"eng+jpn", (char*)"0123456789abcdefghijklmnopqrstuvwxyz");
  trial(img, NULL,             (char*)"");
  trial(img, (char*)"eng",     (char*)"");
  trial(img, (char*)"jpn",     (char*)"");
  trial(img, (char*)"eng+jpn", (char*)"");
  return 0;
}

Test image

image

Result

$ ./a.out
[INPUT ] lang = NULL whitelist = NULL
[OUTPUT] result is 2023l03l18 CIVKZBKL 113510 Hello World


[INPUT ] lang = "eng" whitelist = NULL
[OUTPUT] result is 2023l03l18 CIVKZBKL 113510 Hello World


[INPUT ] lang = "jpn" whitelist = NULL
[OUTPUT] result is 2023 03 ー8   C    。 HeH0 W。r d


[INPUT ] lang = "eng+jpn" whitelist = NULL
[OUTPUT] result is 2023l03l18 CIVKZBKL 113510 Hello World


[INPUT ] lang = NULL whitelist = "0123456789abcdefghijklmnopqrstuvwxyz"
[OUTPUT] result is 2023l03l18 llvli510x 113510 klello world


[INPUT ] lang = "eng" whitelist = "0123456789abcdefghijklmnopqrstuvwxyz"
[OUTPUT] result is 2023l03l18 llvli510x 113510 klello world


[INPUT ] lang = "jpn" whitelist = "0123456789abcdefghijklmnopqrstuvwxyz"
[OUTPUT] result is 2023 03 ー8   g    。  e 0 w0r d


[INPUT ] lang = "eng+jpn" whitelist = "0123456789abcdefghijklmnopqrstuvwxyz"
[OUTPUT] result is 2023l03l18 llvli510x 113510 klello world


[INPUT ] lang = NULL whitelist = ""
[OUTPUT] result is 2023/03/18 CNKZBKL 1135?.) Hello,World.


[INPUT ] lang = "eng" whitelist = ""
[OUTPUT] result is 2023/03/18 CNKZBKL 1135?.) Hello,World.


[INPUT ] lang = "jpn" whitelist = ""
[OUTPUT] result is 2023/03/ー8 こんにちわヽ 世界。 He=。,w。rーd.


[INPUT ] lang = "eng+jpn" whitelist = ""
[OUTPUT] result is 2023/03/18 こんにちわヽ 世界。 Hello,World.

@n0099
Copy link
Author

n0099 commented Mar 18, 2023

Thanks for your comprehensive sample to represent this issue.

Personally I would prefer option 2 since the user might not even notice this fallback behavior before they encounter images without any Latin characters recognized as mystery alphanumeric soup if we just warn about this in the OpenCV document. Also for users using wrapper library in languages other than C/C++ and python, they might not check the official and latest OpenCV document very often, and Google is still indexing documents on the 3.x branch).
image
But this option basically is a BC breaking change, I don't know any details about the policy to introduce BC break and version number control for OpenCV, so I'm curious if this option is approved, will it only affect the next major version which is 5.0?

Edit: I've found some historical info about versioning:
breaking in 3.0: https://docs.opencv.org/4.x/db/dfa/tutorial_transition_guide.html
4.0 have some minor changes and deprecation for some legacy C api(which will be totally removed in 5.0): https://stackoverflow.com/questions/53906178/how-opencv-4-x-api-is-different-from-previous-version

@Kumataro
Copy link
Contributor

Kumataro commented Mar 21, 2023

About breaking change

Many libraries (including OpenCV) basically don't want breaking changes in minor version upgrades.
If we want to introduce breakig changes, it is good to work at chaining major version upgrade.

Applications should avoid worrying about library versions as much as possible.

From time to time, for some reason the library needs to be updated.
In this time, it is a big inconvenience for the library user(application) that the compilation cannot be done or the calculation result is changed.

So if we change the interface or change the calculation results, we need a compelling reason.

Is default [0-9a-zA-Z] good for any language?

I verified the contents of the dictionary data in tesseract.

For example: https://github.com/tesseract-ocr/langdata/blob/main/eng/eng.wordlist

Currently text module implemantation doen't accept those wordlist as default char_whitelist.

  1. A ligature version is also registered. For example, variations of "st", which is a combination of "s" and "t", are also registered.
  • Master
  • Master
  1. A shortened version is also registered.
  • people's / People's
  1. Versions with prefixes and suffixes are also registered.
  • people, / people.
  • people? / people! / people)
  • "People / People: / People.

Currently wordlists contains non-[0-9a-zA-Z] characters, so I think it's difficult to technically explain why char_whitelist defaults to [0-9a-zA-Z].

I agree with this issue's proposal. I propose to set default char_whitelist to ""(null strings) from OpenCV 4.8.0/3.20.0 3.19.0 (even if it breaks backwards compatibility between minor versions).

I believe this will improve recognition accuracy not only for languages containing non-ASCII characters such as CJK, but also for English (especially for sentences).

@n0099
Copy link
Author

n0099 commented Mar 21, 2023

LGTM

@n0099
Copy link
Author

n0099 commented Mar 21, 2023

Currently text module implemantation doen't accept those wordlist as default char_whitelist.

In fact passing the empty string as the value of param char_whitelist just means using all available chars in the worldlist from langdata, since no chars outside the worldlist will be recognized

set default char_whitelist to ""(null strings) from OpenCV 4.8.0/3.19.0 (even if it breaks backwards compatibility between minor versions).

Will there be any new minor version released for 3.x branch?

@Kumataro
Copy link
Contributor

Hi, I tried to make PR.

The text module seems that no test for character recognition.

https://github.com/opencv/opencv_contrib/tree/3.4/modules/text/test

Knowing the installed language data for character recognition is a prerequisite for conducting tests.
But, current text module does not seem to have this function.

Supporting some test implementation of the text module is likely to be more difficult than writing this patch.


Will there be any new minor version released for 3.x branch?

Curently milestone is here https://github.com/opencv/opencv/milestones

It seems that those release milestones are planed.

  • 5.0
  • 4.8.0 = master
  • 3.4.20

Version 3.4 branches are used for only bug-fix, not for implementation new features.

https://github.com/opencv/opencv/wiki/ChangeLog#version3419

@ucool-wu
Copy link

This problem also makes me waste several hours!

@Kumataro
Copy link
Contributor

#3462

@n0099
Copy link
Author

n0099 commented May 3, 2023

I've found out this fallback has been made in the first time of introducing OCRTesseract nine years ago: 5c89c78#diff-141682c94db0e250c47fdd7743c4d35f6dc0734e5d333e7e5b0f2a819548bd3bR82, affecting from OpenCV 3.0.0-beta to 4.7.0.
What a longstanding feature it is!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants