Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Situation with tests in Tesseract #1627

Closed
zamazan4ik opened this issue Jun 3, 2018 · 54 comments
Closed

RFC: Situation with tests in Tesseract #1627

zamazan4ik opened this issue Jun 3, 2018 · 54 comments

Comments

@zamazan4ik
Copy link
Contributor

zamazan4ik commented Jun 3, 2018

Hello.
I have some questions about situation with tests in Tesseract repo.

  1. I think we have lack of unit-tests in Tesseract repo. Here I see some tests, but it is too few, I think. Should we add more unit-tests? Should we write more unit-tests for old engine? Or just write unit-tests for newer LSTM engine? Also I think we can move unit-tests also to CMake. And we can intgrate running unit-tests to TravisCI/Appveyor.
  2. I am a little bit confused about testing recognition quality. How do we do it? I found this, but this seems out-dated stuff. Can we collect different images (of course, only images with good license for us), prepare ground-truth and check Tesseract on our set for regressions? I think it's very important for OCR-engine. Also we can integrate regression tests to TravisCI/Appveyor.
  3. We must test Tesseract with Google sanitizers. For this we must have some tests, compile Tesseract with sanitizers and run tests. I think we can find some errors. Also this way will prevent from some possible mistakes in future.
  4. (Hint) I suggest collect all images from issues and add it to our test set.
    I try to understand, can I work on this way for Tesseract? Will this work be welcomed?
  5. As mentioned in Use OSS-Fuzz for improved code quality #1351, we should try to use OSS-FUZZ with Tesseract. At my last work we found a lot of problems with different Tesseract options (some of them led to crash)

I suggest discuss here about testing stuff.

@jbreiden
Copy link
Contributor

jbreiden commented Jun 4, 2018

Google has a bunch of tests that we should add to the repo. They will need some effort to get them to work there. Here's an example of one of them, note how we'll have to change stuff like ABSL_ARRAYSIZE

#include "tesseract/ccstruct/statistc.h"
#include "tesseract/ccutil/genericvector.h"
#include "tesseract/ccutil/kdpair.h"

namespace {

const int kTestData[] = { 2, 0, 12, 1, 1, 2, 10, 1, 0, 0, 0, 2, 0, 4, 1, 1 };

class STATSTest : public testing::Test {
 public:
  void SetUp() {
    stats_.set_range(0, 16);
    for (int i = 0; i < ABSL_ARRAYSIZE(kTestData); ++i)
      stats_.add(i, kTestData[i]);
  }

  void TearDown() {
  }

  STATS stats_;
};

// Tests some basic numbers from the stats_.
TEST_F(STATSTest, BasicStats) {
  EXPECT_EQ(37, stats_.get_total());
  EXPECT_EQ(2, stats_.mode());
  EXPECT_EQ(12, stats_.pile_count(2));
}

// Tests the top_n_modes function.
TEST_F(STATSTest, TopNModes) {
  GenericVector<tesseract::KDPairInc<float, int> > modes;
  int num_modes = stats_.top_n_modes(3, &modes);
  EXPECT_EQ(3, num_modes);
  // Mode0 is 12 1 1 = 14 total count with a mean of 2 3/14.
  EXPECT_FLOAT_EQ(2.0f + 3.0f / 14, modes[0].key);
  EXPECT_EQ(14, modes[0].data);
  // Mode 1 is 2 10 1 = 13 total count with a mean of 5 12/13.
  EXPECT_FLOAT_EQ(5.0f + 12.0f / 13, modes[1].key);
  EXPECT_EQ(13, modes[1].data);
  // Mode 2 is 4 1 1 = 6 total count with a mean of 13.5.
  EXPECT_FLOAT_EQ(13.5f, modes[2].key);
  EXPECT_EQ(6, modes[2].data);
}

}  // namespace

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 4, 2018 via email

@jbreiden
Copy link
Contributor

jbreiden commented Jun 4, 2018

Google has 56 files of tests for Tesseract. None of them will work as-is with the GitHub repo, but at least some could be adapted without too much effort. Might be a good starting point, especially for someone like @zamazan4ik who sounds excited about writing or improving test.

@Shreeshrii
Copy link
Collaborator

@jbreiden It will be great if you can add them to GitHub for @zamazan4ik to update. Thanks!

@stweil
Copy link
Member

stweil commented Jun 4, 2018

@zamazan4ik, could you please add "RFC:" to the title of this issue ("RFC: Situation with tests in Tesseract")? That makes it clear that it is not a bug report.

@jbreiden, I also think that the available test code should be added to git, even if it is currently not integrated in the build process. Please add only text files (test code) to tesseract git. If there are also binaries (images, tessdata), they can be added to https://github.com/tesseract-ocr/test.

@zamazan4ik zamazan4ik changed the title Situation with tests in Tesseract RFC: Situation with tests in Tesseract Jun 4, 2018
@zamazan4ik
Copy link
Contributor Author

@jbreiden Thank you for the information. Seems like Google has unit-tests. Okay, we can wait for them. But what about regression tests? Has Google anything for this? Or we should prepare images and ground truth?

@zamazan4ik
Copy link
Contributor Author

@stweil If I have some images with/without ground truth for them, should I add them to https://github.com/tesseract-ocr/test? Are there any special requirements for test data?

@zamazan4ik
Copy link
Contributor Author

And I want to clarify situation with unit-tests. Do we want to wait for unit-tests from Google or start implement our own unit-tests?

@Shreeshrii
Copy link
Collaborator

If you have any specific unitest in mind, please go ahead and implement it.

We can add the ones from Google as and when they are added to the repo and modified to work with the code in GitHub.

There was discussion in another thread, regarding putting all binaries related to testing in a separate repo (test) which can be invoked as a submodule so that the tesseract repo does not become very large.

@zdenop has already created a new repo and the images used by current unittests should also be moved there.

Additionally, it might be possible to reduce some test file sizes for image files.

@zamazan4ik
Copy link
Contributor Author

@Shreeshrii Was any discussion about measuring recognition quality between different Tesseract runs?

@stweil
Copy link
Member

stweil commented Jun 4, 2018

I don't remember such discussions, but I think that measuring the quality (not only for text recognition, but also for layout recognition) should be part of the regression tests.

@Shreeshrii
Copy link
Collaborator

No, that was not covered. Google may have these internal tests, since Ray puts statistics in his presentations, but nothing was mentioned in context of the open source code.

However, I think it is important to check for regression, at least with some sample images to begin with.

The UNLV datasets are only for limited set of languages. I would like us to be able to test each language and script, even if is a with a single one page image.

That dataset might take some time to build, but if a framework for that can be setup new language tests can be added as and when the image and matching ground truth becomes available.

@Shreeshrii
Copy link
Collaborator

For example the tests should catch cases like:

#682
LSTM: khmer is not working with --oem 1

@stweil
Copy link
Member

stweil commented Jun 4, 2018

If I have some images with/without ground truth for them, should I add them to https://github.com/tesseract-ocr/test? Are there any special requirements for test data?

I would not use the tesseract-ocr repositories to collect all kinds of ground truth, but of course some examples are needed for the regression tests. We need them to measure the recognition error rate, and we need them if we have tests for training, too.

Maybe we could also start a Wiki page https://github.com/tesseract-ocr/tesseract/wiki/Ground-Truth to collect good sources of ground truth, like we collect information on fonts at https://github.com/tesseract-ocr/tesseract/wiki/Fonts.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 4, 2018

That dataset might take some time to build,

Synthetic test data using a single font such as Noto Sans can be built using a single page of training text for all languages. Since all languages will not have high accuracy, we can have a cutoff for accuracy or have a parameter whose value can be set based on language.

@stweil had also suggested at one point, to just load all languages to make sure traineddata files are valid and don't crash.

@Shreeshrii
Copy link
Collaborator

we could also start a Wiki page https://github.com/tesseract-ocr/tesseract/wiki/Ground-Truth

That is a very good idea.

@zamazan4ik
Copy link
Contributor Author

I am not sure that we can find easily good source of images and ground truth for Tesseract. But I can prepare manually some of them and publish it under appropriate license.

Also there is large image collection here: https://github.com/renard314/textfairy/tree/master/test-images
But I am not sure can we use it or not for Tesseract

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 4, 2018

Also see tesseract-ocr/tessdata_best#27 (comment)
which has accuracy reports for khmer comparing 4.0.0alpha file with finetuned version.

@zamazan4ik
Copy link
Contributor Author

Ok. I think I can start work on images and ground truth for them. Also I prefer way to store images and other needed information for tests in tesseract repos, because links have chance to become broken.

After getting some information we should decide how to:

  1. Integrate into test workflow.
  2. Measure results between tesseract runs.

Do you have any ideas, how we can measure results? Should we use some special tools or can use something simpler like Levenshtein distance per character/per word?

@zamazan4ik
Copy link
Contributor Author

zamazan4ik commented Jun 4, 2018

Also TextFairy's author allowed use image from TextFairy for Tesseract testing: https://github.com/renard314/textfairy/tree/master/test-images

I think I can prepare ground truth for some of them

@stweil
Copy link
Member

stweil commented Jun 4, 2018

Storing binaries in Git results in very large repositories, so there are good reasons to keep the source repository small for builds without tests. Repositories with test data can be included as Git submodules, then there won't be problems with broken links. The tests only have to check whether the needed submodules are available and only run if they do.

@zamazan4ik
Copy link
Contributor Author

@stweil Of course. There is no reason to store test scripts and test data in one repo. You are right

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 5, 2018 via email

@zamazan4ik
Copy link
Contributor Author

As good start point we can upload to our test data repository these files: https://code.google.com/archive/p/isri-ocr-evaluation-tools/downloads

Inside we have a lot already binarized images with ground truth for every image. After that we can run Tesseract for every file (we can parallelize it with GNU\Parallel) for speedup.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 5, 2018 via email

@Shreeshrii
Copy link
Collaborator

Also see https://github.com/tesseract-ocr/tesseract/wiki/UNLV-Testing-of-Tesseract#example-results

for comparison of results up to 3.04.01

@Shreeshrii
Copy link
Collaborator

FYI, so that there is no duplication of efforts.

I am changing the current unittests to use the test submodule. Also updating the instructions and scripts for the UNLV tests.

@Shreeshrii
Copy link
Collaborator

@stweil @zdenop Have you run the UNLV testsuite recently?

@zdenop
Copy link
Contributor

zdenop commented Jun 6, 2018

I did not.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 6, 2018

@stweil I am running into an error with the unlvtests. Need your help to fix.

The batch file at the end calls these two c programs. The batch is passing a list of space separated filenames in $accfiles and $wafiles. I tried two variations below, both are getting error. Maybe the string needs to be parsed???

  unlvtests/ocreval/bin/accsum "$accfiles >unlvtests/reports/$setname.characc"
  unlvtests/ocreval/bin/wordaccsum "$wafiles" >"unlvtests/reports/$setname.wordacc"

The programs check for the following:

   initialize(&argc, argv, usage, NULL);
    if (argc < 2)
	error("not enough input files", Exit);

I am getting the error 'not enough input files' even though a long list is given. Here is the output from a test sample.

+ unlvtests/ocreval/bin/ocrevalutf8 unlvtests/ocreval/bin/accuracy /home/ubuntu/ISRI-OCRtk/bus.3B/0/8520_001.3B.txt unlvtests/results/bus.3B/8520_001.3B.unlv
+ accfiles=' unlvtests/results/bus.3B/8500_001.3B.acc unlvtests/results/bus.3B/8510_001.3B.acc unlvtests/results/bus.3B/8520_001.3B.acc'
+ unlvtests/ocreval/bin/ocrevalutf8 unlvtests/ocreval/bin/wordacc /home/ubuntu/ISRI-OCRtk/bus.3B/0/8520_001.3B.txt unlvtests/results/bus.3B/8520_001.3B.unlv
+ wafiles=' unlvtests/results/bus.3B/8500_001.3B.wa unlvtests/results/bus.3B/8510_001.3B.wa unlvtests/results/bus.3B/8520_001.3B.wa'
+ read page dir
+ unlvtests/ocreval/bin/accsum ' unlvtests/results/bus.3B/8500_001.3B.acc unlvtests/results/bus.3B/8510_001.3B.acc unlvtests/results/bus.3B/8520_001.3B.acc >unlvtests/reports/bus.3B.characc'
accsum: not enough input files
+ unlvtests/ocreval/bin/wordaccsum ' unlvtests/results/bus.3B/8500_001.3B.wa unlvtests/results/bus.3B/8510_001.3B.wa unlvtests/results/bus.3B/8520_001.3B.wa'
wordaccsum: not enough input files

@Shreeshrii
Copy link
Collaborator

I removed the quote marks from the following and it seems to be working.

unlvtests/ocreval/bin/accsum "$accfiles >unlvtests/reports/$setname.characc"
unlvtests/ocreval/bin/wordaccsum "$wafiles" >"unlvtests/reports/$setname.wordacc"

@stweil
Copy link
Member

stweil commented Jun 6, 2018

Does pull request #1640 fix the problem with the UNLV tests?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 7, 2018

Thanks @stweil. I had removed all quotes from those two lines locally and the script was working. I will test with PR #1640 later today.

I am wondering though whether the problem is related to using a different version of bash, since the quotes were there in original script too - I was mostly making changes for new path for the files.

Here are the results:

Testid  	Testset Character               Word                    Non-stopword
			Errors  Acc     Change  Errors  Acc     Change  Errors  Acc     Change

1995		bus.3B	5959	98.14%	0.00%	1631	96.83%	0.00%	1293	95.73%	0.00%
1995		doe3.3B	36349	97.52%	0.00%	7826	96.34%	0.00%	7042	94.87%	0.00%
1995		mag.3B	15043	97.74%	0.00%	4566	96.01%	0.00%	3379	94.99%	0.00%
1995		news.3B	6432	98.69%	0.00%	1946	97.68%	0.00%	1502	96.94%	0.00%

4.0.0-beta	bus.3B	6158	98.10%	 3.34%	1136	97.88%	-30.35%	961		97.06	-25.68%	 4500.68s
4.0.0-beta	doe3.3B	29914	97.97%	-17.70%	13716	94.48%	 75.26%	13113	92.42	 86.21%	19882.96s
4.0.0-beta	mag.3B	10946	98.37%	-27.24%	3337	97.16%	-26.92%	2807	96.07	-16.93%	 7322.79s
4.0.0-beta	news.3B	5678	98.85%	-11.72%	1308	98.46%	-32.79%	1033	97.96	-31.23%	 5651.85s
4.0.0-beta	Total	52696	-		-17.38%	19497	-		 22.09%	17914		-	 35.55%

3.02.02		bus.3B	6039	98.11%	 1.34%	1541	97.01%	- 5.52%	1240	95.90	 -4.10%
3.02.02		doe3.3B	35988	97.54%	-0.99%	8482	96.03%	  8.38%	7640	94.43	  8.49%
3.02.02		mag.3B	14367	97.84%	-4.49%	3891	96.60%	-14.78%	3024	95.52	-10.51%
3.02.02		news.3B	7148	98.55%	11.13%	1484	98.23%	-23.74%	1152	97.65	-23.30%
3.02.02		Total	63542	-		-0.38%	15398	-		 -3.58%	13056		-	 -1.21%

I also included 3.02.02 results from https://github.com/tesseract-ocr/tesseract/wiki/UNLV-Testing-of-Tesseract - these were originally reported in the tesseract-ocr forum.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 7, 2018

The original post by Tom Morris was in tesseract-dev group (not tesseract-ocr)

Here is the link to the discussion thread which also has info regarding other ground truth datasets.

@amitdo
Copy link
Collaborator

amitdo commented Jun 7, 2018

Which traineddata did you use?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 7, 2018

I used tessdata_fast.

The scripts will need further modifications to take training directory as parameter.

The timing results will certainly be different on different machines. I don't know whether accuracy will also change.

@Shreeshrii
Copy link
Collaborator

I re-ran the tests for English with tessdata_fast and the numbers are slightly different. The only difference in scripts is conversion of the files to UTF-8 to allow for accented letters. For English, it is é.


1995    bus.3B  5959    98.14%  0.00%   1631    96.83%  0.00%   1293    95.73%  0.00%
1995    doe3.3B 36349   97.52%  0.00%   7826    96.34%  0.00%   7042    94.87%  0.00%
1995    mag.3B  15043   97.74%  0.00%   4566    96.01%  0.00%   3379    94.99%  0.00%
1995    news.3B 6432    98.69%  0.00%   1946    97.68%  0.00%   1502    96.94%  0.00%
4_fast_eng      bus.3B  6124    98.11%  2.77%   1138    97.88%  -30.23% 963     97.05   -25.52% 3935.26s
4_fast_eng      doe3.3B 30029   97.96%  -17.39% 13781   94.45%  76.09%  13178   92.38   87.13%  18847.36s
4_fast_eng      mag.3B  10934   98.37%  -27.32% 3343    97.15%  -26.78% 2813    96.06   -16.75% 6867.14s
4_fast_eng      news.3B 5734    98.84%  -10.85% 1322    98.45%  -32.07% 1040    97.94   -30.76% 5527.38s
4_fast_eng      Total   52821   -       -17.19% 19584   -       22.64%  17994   -       36.15%

@zdenop
Copy link
Contributor

zdenop commented Sep 30, 2018

Anything else should be done here?
Test image data are at https://github.com/tesseract-ocr/test
unittest are in https://github.com/tesseract-ocr/tesseract/tree/master/unittest

@stweil
Copy link
Member

stweil commented Oct 10, 2018

Anything else should be done here?

There remain some tests to be fixed:

baseapi_test
baseapi_thread_test
dawg_test
equationdetect_test
fileio_test
imagedata_test
lang_model_test
layout_test
ligature_table_test
lstm_recode_test
lstm_squashed_test
lstm_test
lstmtrainer_test
mastertrainer_test
networkio_test
normstrngs_test
pagesegmode_test
pango_font_info_test
paragraphs_test
params_model_test
qrsequence_test
recodebeam_test
resultiterator_test
scanutils_test
shapetable_test
stridemap_test
stringrenderer_test
tatweel_test
textlineprojection_test
unicharcompress_test
unicharset_test
unichar_test
validate_grapheme_test
validate_indic_test
validate_khmer_test
validate_myanmar_test

@Shreeshrii
Copy link
Collaborator

See #1863 (comment)

Most of the above unittests can be built and pass. Thanks @stweil.

@jbreiden We are still missing some 'testdata', specially for the lstm related tests. Is it possible to get it, as well as the logs from the test run? I can make a list of needed files.

@stweil
Copy link
Member

stweil commented Mar 8, 2019

OSS Fuzz is now supported.

@bertsky
Copy link
Contributor

bertsky commented Mar 10, 2019

After going through the experience of undertaking the unit tests for a single contribution (#2294), I have a few suggestions here.

Foremost, I believe that the implicit dependency on the data repos should be made explicit (while still being optional): Currently make check relies on the assumption that data files were put in upward directories, which is some surprise and cannot be changed. Since submodules are already used here, why not just make tessdata, tessdata_best, tessdata_fast, and langdata_lstm submodules of the test submodule?

Also, why not simply skip all tests that cannot be satisfied due to missing data repos (instead of failing with crash reports)?

Moreover, I suggest including make check and all it entails (deps and data repos) into the Travis CI configuration.

Lastly, IMHO the tesstutorial on training should be fully scripted, with explicit dependencies, incremental stages and automated result verification. The individual commands would have to be split up into recipes of makefile targets I guess.

@stweil
Copy link
Member

stweil commented Jul 6, 2019

In the meantime we have more than 50 working unit tests. These tests are still missing:

pagesegmode_test – needs missing TIFF files
tatweel_test – needs files ara.* (from ara.traineddata?)

I have fixed the code for both, but they fail because of the missing files.

@Shreeshrii
Copy link
Collaborator

Thanks @stweil.

Do you think that tatweel_test will work if we extract the files from tessdata_fast/ara.traineddata?

@stweil
Copy link
Member

stweil commented Jul 6, 2019

The last subtest requires an old ara.unicharset to test backwards compatibility. I did not find one for that test, not even in the Git history of tessdata.

@stweil
Copy link
Member

stweil commented Jul 7, 2019

tatweel_test is now in Git master and automatically skips the subtests with missing files. Maybe @jbreiden can provide ara.wordlist and ara.unicharset which are missing here.

@jbreiden
Copy link
Contributor

jbreiden commented Jul 9, 2019 via email

@stweil
Copy link
Member

stweil commented Jul 9, 2019

@jbreiden provided the ara.* files, and tatweel_test is fine now. With a little help from Jeff, I also managed to make pagesegmode_test work.

So there are now 60 working unit tests with more than 300 working subtests for Tesseract.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jul 9, 2019

@stweil and @jbreiden Thank you for getting the unit tests working for tesseract.

@jbreiden Are there any regression tests that can be open sourced?

Also, the LSTM langdata for Arabic is is the same as 3.0x langdata - only 80 lines of training_text. Is training data from Ray's last traaining run available?

@stweil It will be good to split the tests into legacy/LSTM. Then --disable-legacy could be used for testing only the LSTM tests.

@stweil
Copy link
Member

stweil commented Dec 22, 2020

pango_font_info_test now no longer depends on TensorFlow, see PR #3189.

@Shreeshrii
Copy link
Collaborator

@stweil Do you have a test suite for comparing performance + accuracy? It would be good to check the effect of recent changes.

@stweil
Copy link
Member

stweil commented Jan 6, 2021

No, I don't, but would like to have one, too. I recently started comparing the time for lstm_testwhich should be a good indicator for the training performance, but also for recognition performance with tessdata_best models.

The recent changes should not affect accuracy. That has to be tested of course.

@zdenop
Copy link
Contributor

zdenop commented Jan 6, 2021

maybe good start point regarding performance measurement is #263

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants