RFC: Situation with tests in Tesseract #1627

zamazan4ik · 2018-06-03T22:39:29Z

Hello.
I have some questions about situation with tests in Tesseract repo.

I think we have lack of unit-tests in Tesseract repo. Here I see some tests, but it is too few, I think. Should we add more unit-tests? Should we write more unit-tests for old engine? Or just write unit-tests for newer LSTM engine? Also I think we can move unit-tests also to CMake. And we can intgrate running unit-tests to TravisCI/Appveyor.
I am a little bit confused about testing recognition quality. How do we do it? I found this, but this seems out-dated stuff. Can we collect different images (of course, only images with good license for us), prepare ground-truth and check Tesseract on our set for regressions? I think it's very important for OCR-engine. Also we can integrate regression tests to TravisCI/Appveyor.
We must test Tesseract with Google sanitizers. For this we must have some tests, compile Tesseract with sanitizers and run tests. I think we can find some errors. Also this way will prevent from some possible mistakes in future.
(Hint) I suggest collect all images from issues and add it to our test set.
I try to understand, can I work on this way for Tesseract? Will this work be welcomed?
As mentioned in Use OSS-Fuzz for improved code quality #1351, we should try to use OSS-FUZZ with Tesseract. At my last work we found a lot of problems with different Tesseract options (some of them led to crash)

I suggest discuss here about testing stuff.

jbreiden · 2018-06-04T02:33:04Z

Google has a bunch of tests that we should add to the repo. They will need some effort to get them to work there. Here's an example of one of them, note how we'll have to change stuff like ABSL_ARRAYSIZE

#include "tesseract/ccstruct/statistc.h"
#include "tesseract/ccutil/genericvector.h"
#include "tesseract/ccutil/kdpair.h"

namespace {

const int kTestData[] = { 2, 0, 12, 1, 1, 2, 10, 1, 0, 0, 0, 2, 0, 4, 1, 1 };

class STATSTest : public testing::Test {
 public:
  void SetUp() {
    stats_.set_range(0, 16);
    for (int i = 0; i < ABSL_ARRAYSIZE(kTestData); ++i)
      stats_.add(i, kTestData[i]);
  }

  void TearDown() {
  }

  STATS stats_;
};

// Tests some basic numbers from the stats_.
TEST_F(STATSTest, BasicStats) {
  EXPECT_EQ(37, stats_.get_total());
  EXPECT_EQ(2, stats_.mode());
  EXPECT_EQ(12, stats_.pile_count(2));
}

// Tests the top_n_modes function.
TEST_F(STATSTest, TopNModes) {
  GenericVector<tesseract::KDPairInc<float, int> > modes;
  int num_modes = stats_.top_n_modes(3, &modes);
  EXPECT_EQ(3, num_modes);
  // Mode0 is 12 1 1 = 14 total count with a mean of 2 3/14.
  EXPECT_FLOAT_EQ(2.0f + 3.0f / 14, modes[0].key);
  EXPECT_EQ(14, modes[0].data);
  // Mode 1 is 2 10 1 = 13 total count with a mean of 5 12/13.
  EXPECT_FLOAT_EQ(5.0f + 12.0f / 13, modes[1].key);
  EXPECT_EQ(13, modes[1].data);
  // Mode 2 is 4 1 1 = 6 total count with a mean of 13.5.
  EXPECT_FLOAT_EQ(13.5f, modes[2].key);
  EXPECT_EQ(6, modes[2].data);
}

}  // namespace

Shreeshrii · 2018-06-04T03:50:33Z

Jeff, Ray had started transferring some tests but hit a road block with one which included some file io. We couldn't get it to build as tesseract repo is missing some libraries used at Google. The source is there in unittests folder but it is not included in makefile.

…

On Mon 4 Jun, 2018, 8:03 AM jbreiden, ***@***.***> wrote: Google has a bunch of tests that we should add to the repo. Here's an example of one of them. I can help with this. #include "tesseract/ccstruct/statistc.h" #include "tesseract/ccutil/genericvector.h" #include "tesseract/ccutil/kdpair.h" namespace { const int kTestData[] = { 2, 0, 12, 1, 1, 2, 10, 1, 0, 0, 0, 2, 0, 4, 1, 1 }; class STATSTest : public testing::Test { public: void SetUp() { stats_.set_range(0, 16); for (int i = 0; i < ABSL_ARRAYSIZE(kTestData); ++i) stats_.add(i, kTestData[i]); } void TearDown() { } STATS stats_; }; // Tests some basic numbers from the stats_. TEST_F(STATSTest, BasicStats) { EXPECT_EQ(37, stats_.get_total()); EXPECT_EQ(2, stats_.mode()); EXPECT_EQ(12, stats_.pile_count(2)); } // Tests the top_n_modes function. TEST_F(STATSTest, TopNModes) { GenericVector<tesseract::KDPairInc<float, int> > modes; int num_modes = stats_.top_n_modes(3, &modes); EXPECT_EQ(3, num_modes); // Mode0 is 12 1 1 = 14 total count with a mean of 2 3/14. EXPECT_FLOAT_EQ(2.0f + 3.0f / 14, modes[0].key); EXPECT_EQ(14, modes[0].data); // Mode 1 is 2 10 1 = 13 total count with a mean of 5 12/13. EXPECT_FLOAT_EQ(5.0f + 12.0f / 13, modes[1].key); EXPECT_EQ(13, modes[1].data); // Mode 2 is 4 1 1 = 6 total count with a mean of 13.5. EXPECT_FLOAT_EQ(13.5f, modes[2].key); EXPECT_EQ(6, modes[2].data); } } // namespace — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1627 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o4fW7j4uPre6P7gRXa4C9cNneRKiks5t5JxtgaJpZM4UYRTW> .

jbreiden · 2018-06-04T04:06:28Z

Google has 56 files of tests for Tesseract. None of them will work as-is with the GitHub repo, but at least some could be adapted without too much effort. Might be a good starting point, especially for someone like @zamazan4ik who sounds excited about writing or improving test.

Shreeshrii · 2018-06-04T05:10:24Z

@jbreiden It will be great if you can add them to GitHub for @zamazan4ik to update. Thanks!

stweil · 2018-06-04T05:46:33Z

@zamazan4ik, could you please add "RFC:" to the title of this issue ("RFC: Situation with tests in Tesseract")? That makes it clear that it is not a bug report.

@jbreiden, I also think that the available test code should be added to git, even if it is currently not integrated in the build process. Please add only text files (test code) to tesseract git. If there are also binaries (images, tessdata), they can be added to https://github.com/tesseract-ocr/test.

zamazan4ik · 2018-06-04T08:41:49Z

@jbreiden Thank you for the information. Seems like Google has unit-tests. Okay, we can wait for them. But what about regression tests? Has Google anything for this? Or we should prepare images and ground truth?

zamazan4ik · 2018-06-04T09:15:00Z

@stweil If I have some images with/without ground truth for them, should I add them to https://github.com/tesseract-ocr/test? Are there any special requirements for test data?

zamazan4ik · 2018-06-04T09:50:22Z

And I want to clarify situation with unit-tests. Do we want to wait for unit-tests from Google or start implement our own unit-tests?

Shreeshrii · 2018-06-04T09:57:13Z

If you have any specific unitest in mind, please go ahead and implement it.

We can add the ones from Google as and when they are added to the repo and modified to work with the code in GitHub.

There was discussion in another thread, regarding putting all binaries related to testing in a separate repo (test) which can be invoked as a submodule so that the tesseract repo does not become very large.

@zdenop has already created a new repo and the images used by current unittests should also be moved there.

Additionally, it might be possible to reduce some test file sizes for image files.

zamazan4ik · 2018-06-04T10:06:50Z

@Shreeshrii Was any discussion about measuring recognition quality between different Tesseract runs?

stweil · 2018-06-04T10:17:12Z

I don't remember such discussions, but I think that measuring the quality (not only for text recognition, but also for layout recognition) should be part of the regression tests.

Shreeshrii · 2018-06-04T10:18:03Z

No, that was not covered. Google may have these internal tests, since Ray puts statistics in his presentations, but nothing was mentioned in context of the open source code.

However, I think it is important to check for regression, at least with some sample images to begin with.

The UNLV datasets are only for limited set of languages. I would like us to be able to test each language and script, even if is a with a single one page image.

That dataset might take some time to build, but if a framework for that can be setup new language tests can be added as and when the image and matching ground truth becomes available.

Shreeshrii · 2018-06-04T10:23:44Z

For example the tests should catch cases like:

#682
LSTM: khmer is not working with --oem 1

stweil · 2018-06-04T10:24:58Z

If I have some images with/without ground truth for them, should I add them to https://github.com/tesseract-ocr/test? Are there any special requirements for test data?

I would not use the tesseract-ocr repositories to collect all kinds of ground truth, but of course some examples are needed for the regression tests. We need them to measure the recognition error rate, and we need them if we have tests for training, too.

Maybe we could also start a Wiki page https://github.com/tesseract-ocr/tesseract/wiki/Ground-Truth to collect good sources of ground truth, like we collect information on fonts at https://github.com/tesseract-ocr/tesseract/wiki/Fonts.

Shreeshrii · 2018-06-04T10:30:21Z

That dataset might take some time to build,

Synthetic test data using a single font such as Noto Sans can be built using a single page of training text for all languages. Since all languages will not have high accuracy, we can have a cutoff for accuracy or have a parameter whose value can be set based on language.

@stweil had also suggested at one point, to just load all languages to make sure traineddata files are valid and don't crash.

Shreeshrii · 2018-06-04T10:33:42Z

we could also start a Wiki page https://github.com/tesseract-ocr/tesseract/wiki/Ground-Truth

That is a very good idea.

zamazan4ik · 2018-06-04T10:48:11Z

I am not sure that we can find easily good source of images and ground truth for Tesseract. But I can prepare manually some of them and publish it under appropriate license.

Also there is large image collection here: https://github.com/renard314/textfairy/tree/master/test-images
But I am not sure can we use it or not for Tesseract

Shreeshrii · 2018-06-04T10:49:59Z

Also see tesseract-ocr/tessdata_best#27 (comment)
which has accuracy reports for khmer comparing 4.0.0alpha file with finetuned version.

zamazan4ik · 2018-06-04T20:25:15Z

Ok. I think I can start work on images and ground truth for them. Also I prefer way to store images and other needed information for tests in tesseract repos, because links have chance to become broken.

After getting some information we should decide how to:

Integrate into test workflow.
Measure results between tesseract runs.

Do you have any ideas, how we can measure results? Should we use some special tools or can use something simpler like Levenshtein distance per character/per word?

zamazan4ik · 2018-06-04T20:32:11Z

Also TextFairy's author allowed use image from TextFairy for Tesseract testing: https://github.com/renard314/textfairy/tree/master/test-images

I think I can prepare ground truth for some of them

stweil · 2018-06-04T20:35:54Z

Storing binaries in Git results in very large repositories, so there are good reasons to keep the source repository small for builds without tests. Repositories with test data can be included as Git submodules, then there won't be problems with broken links. The tests only have to check whether the needed submodules are available and only run if they do.

zamazan4ik · 2018-06-04T20:38:58Z

@stweil Of course. There is no reason to store test scripts and test data in one repo. You are right

Shreeshrii · 2018-06-05T02:36:14Z

I have uploaded the test binaries to tesseract-ocr/test and have created a PR #1632 to add it as submodule in tesseract. @stweil what will be best method for updating the submodule in tesseract as more test images are added to `test` repo? ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 5, 2018 at 2:09 AM, Alexander ***@***.***> wrote: @stweil <https://github.com/stweil> Of course. There is no reason to store test scripts and test data in one repo. You are right — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1627 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o12V_zYHz0g520_2A1coYJoete1oks5t5ZrygaJpZM4UYRTW> .

zamazan4ik · 2018-06-05T05:29:20Z

As good start point we can upload to our test data repository these files: https://code.google.com/archive/p/isri-ocr-evaluation-tools/downloads

Inside we have a lot already binarized images with ground truth for every image. After that we can run Tesseract for every file (we can parallelize it with GNU\Parallel) for speedup.

Shreeshrii · 2018-06-05T06:41:42Z

Please see https://github.com/tesseract-ocr/tesseract/tree/master/testing The scripts there need to be updated to use this newer location. @zdenop also has a copy of the files on sourceforge. https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/ ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jun 5, 2018 at 10:59 AM, Alexander ***@***.***> wrote: As good start point we can upload to our test data repository these files: https://code.google.com/archive/p/isri-ocr-evaluation-tools/downloads Inside we have a lot already binarized images with ground truth for every image. After that we can run Tesseract for every file (we can parallelize it with GNU\Parallel) for speedup. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1627 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o2vqKPUJnjcV7cCUCDCfJ9VzNsTZks5t5hdBgaJpZM4UYRTW> .

Shreeshrii · 2018-06-05T07:55:18Z

Also see https://github.com/tesseract-ocr/tesseract/wiki/UNLV-Testing-of-Tesseract#example-results

for comparison of results up to 3.04.01

Shreeshrii · 2018-06-06T07:03:01Z

FYI, so that there is no duplication of efforts.

I am changing the current unittests to use the test submodule. Also updating the instructions and scripts for the UNLV tests.

Shreeshrii · 2018-06-06T07:06:24Z

@stweil @zdenop Have you run the UNLV testsuite recently?

zdenop · 2018-06-06T07:11:39Z

I did not.

Shreeshrii · 2018-06-06T14:53:14Z

@stweil I am running into an error with the unlvtests. Need your help to fix.

The batch file at the end calls these two c programs. The batch is passing a list of space separated filenames in $accfiles and $wafiles. I tried two variations below, both are getting error. Maybe the string needs to be parsed???

  unlvtests/ocreval/bin/accsum "$accfiles >unlvtests/reports/$setname.characc"
  unlvtests/ocreval/bin/wordaccsum "$wafiles" >"unlvtests/reports/$setname.wordacc"

The programs check for the following:

   initialize(&argc, argv, usage, NULL);
    if (argc < 2)
	error("not enough input files", Exit);

I am getting the error 'not enough input files' even though a long list is given. Here is the output from a test sample.

+ unlvtests/ocreval/bin/ocrevalutf8 unlvtests/ocreval/bin/accuracy /home/ubuntu/ISRI-OCRtk/bus.3B/0/8520_001.3B.txt unlvtests/results/bus.3B/8520_001.3B.unlv
+ accfiles=' unlvtests/results/bus.3B/8500_001.3B.acc unlvtests/results/bus.3B/8510_001.3B.acc unlvtests/results/bus.3B/8520_001.3B.acc'
+ unlvtests/ocreval/bin/ocrevalutf8 unlvtests/ocreval/bin/wordacc /home/ubuntu/ISRI-OCRtk/bus.3B/0/8520_001.3B.txt unlvtests/results/bus.3B/8520_001.3B.unlv
+ wafiles=' unlvtests/results/bus.3B/8500_001.3B.wa unlvtests/results/bus.3B/8510_001.3B.wa unlvtests/results/bus.3B/8520_001.3B.wa'
+ read page dir
+ unlvtests/ocreval/bin/accsum ' unlvtests/results/bus.3B/8500_001.3B.acc unlvtests/results/bus.3B/8510_001.3B.acc unlvtests/results/bus.3B/8520_001.3B.acc >unlvtests/reports/bus.3B.characc'
accsum: not enough input files
+ unlvtests/ocreval/bin/wordaccsum ' unlvtests/results/bus.3B/8500_001.3B.wa unlvtests/results/bus.3B/8510_001.3B.wa unlvtests/results/bus.3B/8520_001.3B.wa'
wordaccsum: not enough input files

Shreeshrii · 2018-06-06T17:15:53Z

I removed the quote marks from the following and it seems to be working.

unlvtests/ocreval/bin/accsum "$accfiles >unlvtests/reports/$setname.characc"
unlvtests/ocreval/bin/wordaccsum "$wafiles" >"unlvtests/reports/$setname.wordacc"

stweil · 2018-06-06T19:37:19Z

Does pull request #1640 fix the problem with the UNLV tests?

Shreeshrii · 2018-06-07T03:45:47Z

Thanks @stweil. I had removed all quotes from those two lines locally and the script was working. I will test with PR #1640 later today.

I am wondering though whether the problem is related to using a different version of bash, since the quotes were there in original script too - I was mostly making changes for new path for the files.

Here are the results:

Testid  	Testset Character               Word                    Non-stopword
			Errors  Acc     Change  Errors  Acc     Change  Errors  Acc     Change

1995		bus.3B	5959	98.14%	0.00%	1631	96.83%	0.00%	1293	95.73%	0.00%
1995		doe3.3B	36349	97.52%	0.00%	7826	96.34%	0.00%	7042	94.87%	0.00%
1995		mag.3B	15043	97.74%	0.00%	4566	96.01%	0.00%	3379	94.99%	0.00%
1995		news.3B	6432	98.69%	0.00%	1946	97.68%	0.00%	1502	96.94%	0.00%

4.0.0-beta	bus.3B	6158	98.10%	 3.34%	1136	97.88%	-30.35%	961		97.06	-25.68%	 4500.68s
4.0.0-beta	doe3.3B	29914	97.97%	-17.70%	13716	94.48%	 75.26%	13113	92.42	 86.21%	19882.96s
4.0.0-beta	mag.3B	10946	98.37%	-27.24%	3337	97.16%	-26.92%	2807	96.07	-16.93%	 7322.79s
4.0.0-beta	news.3B	5678	98.85%	-11.72%	1308	98.46%	-32.79%	1033	97.96	-31.23%	 5651.85s
4.0.0-beta	Total	52696	-		-17.38%	19497	-		 22.09%	17914		-	 35.55%

3.02.02		bus.3B	6039	98.11%	 1.34%	1541	97.01%	- 5.52%	1240	95.90	 -4.10%
3.02.02		doe3.3B	35988	97.54%	-0.99%	8482	96.03%	  8.38%	7640	94.43	  8.49%
3.02.02		mag.3B	14367	97.84%	-4.49%	3891	96.60%	-14.78%	3024	95.52	-10.51%
3.02.02		news.3B	7148	98.55%	11.13%	1484	98.23%	-23.74%	1152	97.65	-23.30%
3.02.02		Total	63542	-		-0.38%	15398	-		 -3.58%	13056		-	 -1.21%

I also included 3.02.02 results from https://github.com/tesseract-ocr/tesseract/wiki/UNLV-Testing-of-Tesseract - these were originally reported in the tesseract-ocr forum.

Shreeshrii · 2018-06-07T04:03:56Z

The original post by Tom Morris was in tesseract-dev group (not tesseract-ocr)

Here is the link to the discussion thread which also has info regarding other ground truth datasets.

amitdo · 2018-06-07T08:03:51Z

Which traineddata did you use?

Shreeshrii · 2018-06-07T09:10:25Z

I used tessdata_fast.

The scripts will need further modifications to take training directory as parameter.

The timing results will certainly be different on different machines. I don't know whether accuracy will also change.

Shreeshrii · 2018-06-10T02:41:09Z

I re-ran the tests for English with tessdata_fast and the numbers are slightly different. The only difference in scripts is conversion of the files to UTF-8 to allow for accented letters. For English, it is é.


1995    bus.3B  5959    98.14%  0.00%   1631    96.83%  0.00%   1293    95.73%  0.00%
1995    doe3.3B 36349   97.52%  0.00%   7826    96.34%  0.00%   7042    94.87%  0.00%
1995    mag.3B  15043   97.74%  0.00%   4566    96.01%  0.00%   3379    94.99%  0.00%
1995    news.3B 6432    98.69%  0.00%   1946    97.68%  0.00%   1502    96.94%  0.00%
4_fast_eng      bus.3B  6124    98.11%  2.77%   1138    97.88%  -30.23% 963     97.05   -25.52% 3935.26s
4_fast_eng      doe3.3B 30029   97.96%  -17.39% 13781   94.45%  76.09%  13178   92.38   87.13%  18847.36s
4_fast_eng      mag.3B  10934   98.37%  -27.32% 3343    97.15%  -26.78% 2813    96.06   -16.75% 6867.14s
4_fast_eng      news.3B 5734    98.84%  -10.85% 1322    98.45%  -32.07% 1040    97.94   -30.76% 5527.38s
4_fast_eng      Total   52821   -       -17.19% 19584   -       22.64%  17994   -       36.15%

zdenop · 2018-09-30T14:39:12Z

Anything else should be done here?
Test image data are at https://github.com/tesseract-ocr/test
unittest are in https://github.com/tesseract-ocr/tesseract/tree/master/unittest

stweil · 2018-10-10T07:41:34Z

Anything else should be done here?

There remain some tests to be fixed:

baseapi_test
baseapi_thread_test
dawg_test
equationdetect_test
fileio_test
imagedata_test
lang_model_test
layout_test
ligature_table_test
lstm_recode_test
lstm_squashed_test
lstm_test
lstmtrainer_test
mastertrainer_test
networkio_test
normstrngs_test
pagesegmode_test
pango_font_info_test
paragraphs_test
params_model_test
qrsequence_test
recodebeam_test
resultiterator_test
scanutils_test
shapetable_test
stridemap_test
stringrenderer_test
tatweel_test
textlineprojection_test
unicharcompress_test
unicharset_test
unichar_test
validate_grapheme_test
validate_indic_test
validate_khmer_test
validate_myanmar_test

Shreeshrii · 2019-01-26T04:29:25Z

See #1863 (comment)

Most of the above unittests can be built and pass. Thanks @stweil.

@jbreiden We are still missing some 'testdata', specially for the lstm related tests. Is it possible to get it, as well as the logs from the test run? I can make a list of needed files.

stweil · 2019-03-08T13:48:32Z

OSS Fuzz is now supported.

bertsky · 2019-03-10T12:03:09Z

After going through the experience of undertaking the unit tests for a single contribution (#2294), I have a few suggestions here.

Foremost, I believe that the implicit dependency on the data repos should be made explicit (while still being optional): Currently make check relies on the assumption that data files were put in upward directories, which is some surprise and cannot be changed. Since submodules are already used here, why not just make tessdata, tessdata_best, tessdata_fast, and langdata_lstm submodules of the test submodule?

Also, why not simply skip all tests that cannot be satisfied due to missing data repos (instead of failing with crash reports)?

Moreover, I suggest including make check and all it entails (deps and data repos) into the Travis CI configuration.

Lastly, IMHO the tesstutorial on training should be fully scripted, with explicit dependencies, incremental stages and automated result verification. The individual commands would have to be split up into recipes of makefile targets I guess.

stweil · 2019-07-06T06:21:41Z

In the meantime we have more than 50 working unit tests. These tests are still missing:

pagesegmode_test – needs missing TIFF files
tatweel_test – needs files ara.* (from ara.traineddata?)

I have fixed the code for both, but they fail because of the missing files.

Shreeshrii · 2019-07-06T11:20:20Z

Thanks @stweil.

Do you think that tatweel_test will work if we extract the files from tessdata_fast/ara.traineddata?

stweil · 2019-07-06T18:13:27Z

The last subtest requires an old ara.unicharset to test backwards compatibility. I did not find one for that test, not even in the Git history of tessdata.

stweil · 2019-07-07T09:07:59Z

tatweel_test is now in Git master and automatically skips the subtests with missing files. Maybe @jbreiden can provide ara.wordlist and ara.unicharset which are missing here.

jbreiden · 2019-07-09T04:55:19Z

Maybe @jbreiden can provide ara.wordlist and ara.unicharset which are

missing here. I think I can help.

…

stweil · 2019-07-09T12:14:20Z

@jbreiden provided the ara.* files, and tatweel_test is fine now. With a little help from Jeff, I also managed to make pagesegmode_test work.

So there are now 60 working unit tests with more than 300 working subtests for Tesseract.

Shreeshrii · 2019-07-09T13:12:47Z

@stweil and @jbreiden Thank you for getting the unit tests working for tesseract.

@jbreiden Are there any regression tests that can be open sourced?

Also, the LSTM langdata for Arabic is is the same as 3.0x langdata - only 80 lines of training_text. Is training data from Ray's last traaining run available?

@stweil It will be good to split the tests into legacy/LSTM. Then --disable-legacy could be used for testing only the LSTM tests.

stweil · 2020-12-22T13:14:08Z

pango_font_info_test now no longer depends on TensorFlow, see PR #3189.

Shreeshrii · 2021-01-06T04:46:14Z

@stweil Do you have a test suite for comparing performance + accuracy? It would be good to check the effect of recent changes.

stweil · 2021-01-06T10:18:55Z

No, I don't, but would like to have one, too. I recently started comparing the time for lstm_testwhich should be a good indicator for the training performance, but also for recognition performance with tessdata_best models.

The recent changes should not affect accuracy. That has to be tested of course.

zdenop · 2021-01-06T16:52:02Z

maybe good start point regarding performance measurement is #263

zamazan4ik changed the title ~~Situation with tests in Tesseract~~ RFC: Situation with tests in Tesseract Jun 4, 2018

stweil added feature request question labels Oct 1, 2018

stweil mentioned this issue Mar 8, 2019

trying to add tessedit_char_whitelist etc. again: #2294

Merged

amitdo added the unit tests label Mar 10, 2021

amitdo added the RFC label Mar 21, 2021

amitdo closed this as completed Mar 17, 2024

RFC: Situation with tests in Tesseract #1627

RFC: Situation with tests in Tesseract #1627

Comments

zamazan4ik commented Jun 3, 2018 • edited Loading

jbreiden commented Jun 4, 2018 • edited Loading

Shreeshrii commented Jun 4, 2018 via email

jbreiden commented Jun 4, 2018

Shreeshrii commented Jun 4, 2018

stweil commented Jun 4, 2018

zamazan4ik commented Jun 4, 2018

zamazan4ik commented Jun 4, 2018

zamazan4ik commented Jun 4, 2018

Shreeshrii commented Jun 4, 2018

zamazan4ik commented Jun 4, 2018

stweil commented Jun 4, 2018

Shreeshrii commented Jun 4, 2018

Shreeshrii commented Jun 4, 2018

stweil commented Jun 4, 2018

Shreeshrii commented Jun 4, 2018 • edited Loading

Shreeshrii commented Jun 4, 2018

zamazan4ik commented Jun 4, 2018

Shreeshrii commented Jun 4, 2018 • edited Loading

zamazan4ik commented Jun 4, 2018

zamazan4ik commented Jun 4, 2018 • edited Loading

stweil commented Jun 4, 2018

zamazan4ik commented Jun 4, 2018

Shreeshrii commented Jun 5, 2018 via email

zamazan4ik commented Jun 5, 2018

Shreeshrii commented Jun 5, 2018 via email

Shreeshrii commented Jun 5, 2018

Shreeshrii commented Jun 6, 2018

Shreeshrii commented Jun 6, 2018

zdenop commented Jun 6, 2018

Shreeshrii commented Jun 6, 2018 • edited Loading

Shreeshrii commented Jun 6, 2018

stweil commented Jun 6, 2018

Shreeshrii commented Jun 7, 2018 • edited Loading

Shreeshrii commented Jun 7, 2018 • edited Loading

amitdo commented Jun 7, 2018 • edited Loading

Shreeshrii commented Jun 7, 2018 • edited Loading

Shreeshrii commented Jun 10, 2018

zdenop commented Sep 30, 2018

stweil commented Oct 10, 2018

Shreeshrii commented Jan 26, 2019

stweil commented Mar 8, 2019

bertsky commented Mar 10, 2019

stweil commented Jul 6, 2019

Shreeshrii commented Jul 6, 2019

stweil commented Jul 6, 2019

stweil commented Jul 7, 2019

jbreiden commented Jul 9, 2019 via email

stweil commented Jul 9, 2019

Shreeshrii commented Jul 9, 2019 • edited Loading

stweil commented Dec 22, 2020

Shreeshrii commented Jan 6, 2021

stweil commented Jan 6, 2021 • edited Loading

zdenop commented Jan 6, 2021

zamazan4ik commented Jun 3, 2018 •

edited

Loading

jbreiden commented Jun 4, 2018 •

edited

Loading

Shreeshrii commented Jun 4, 2018 •

edited

Loading

Shreeshrii commented Jun 4, 2018 •

edited

Loading

zamazan4ik commented Jun 4, 2018 •

edited

Loading

Shreeshrii commented Jun 6, 2018 •

edited

Loading

Shreeshrii commented Jun 7, 2018 •

edited

Loading

Shreeshrii commented Jun 7, 2018 •

edited

Loading

amitdo commented Jun 7, 2018 •

edited

Loading

Shreeshrii commented Jun 7, 2018 •

edited

Loading

Shreeshrii commented Jul 9, 2019 •

edited

Loading

stweil commented Jan 6, 2021 •

edited

Loading