-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Situation with tests in Tesseract #1627
Comments
Google has a bunch of tests that we should add to the repo. They will need some effort to get them to work there. Here's an example of one of them, note how we'll have to change stuff like #include "tesseract/ccstruct/statistc.h"
#include "tesseract/ccutil/genericvector.h"
#include "tesseract/ccutil/kdpair.h"
namespace {
const int kTestData[] = { 2, 0, 12, 1, 1, 2, 10, 1, 0, 0, 0, 2, 0, 4, 1, 1 };
class STATSTest : public testing::Test {
public:
void SetUp() {
stats_.set_range(0, 16);
for (int i = 0; i < ABSL_ARRAYSIZE(kTestData); ++i)
stats_.add(i, kTestData[i]);
}
void TearDown() {
}
STATS stats_;
};
// Tests some basic numbers from the stats_.
TEST_F(STATSTest, BasicStats) {
EXPECT_EQ(37, stats_.get_total());
EXPECT_EQ(2, stats_.mode());
EXPECT_EQ(12, stats_.pile_count(2));
}
// Tests the top_n_modes function.
TEST_F(STATSTest, TopNModes) {
GenericVector<tesseract::KDPairInc<float, int> > modes;
int num_modes = stats_.top_n_modes(3, &modes);
EXPECT_EQ(3, num_modes);
// Mode0 is 12 1 1 = 14 total count with a mean of 2 3/14.
EXPECT_FLOAT_EQ(2.0f + 3.0f / 14, modes[0].key);
EXPECT_EQ(14, modes[0].data);
// Mode 1 is 2 10 1 = 13 total count with a mean of 5 12/13.
EXPECT_FLOAT_EQ(5.0f + 12.0f / 13, modes[1].key);
EXPECT_EQ(13, modes[1].data);
// Mode 2 is 4 1 1 = 6 total count with a mean of 13.5.
EXPECT_FLOAT_EQ(13.5f, modes[2].key);
EXPECT_EQ(6, modes[2].data);
}
} // namespace |
Jeff,
Ray had started transferring some tests but hit a road block with one which
included some file io. We couldn't get it to build as tesseract repo is
missing some libraries used at Google.
The source is there in unittests folder but it is not included in makefile.
…On Mon 4 Jun, 2018, 8:03 AM jbreiden, ***@***.***> wrote:
Google has a bunch of tests that we should add to the repo. Here's an
example of one of them. I can help with this.
#include "tesseract/ccstruct/statistc.h"
#include "tesseract/ccutil/genericvector.h"
#include "tesseract/ccutil/kdpair.h"
namespace {
const int kTestData[] = { 2, 0, 12, 1, 1, 2, 10, 1, 0, 0, 0, 2, 0, 4, 1, 1 };
class STATSTest : public testing::Test {
public:
void SetUp() {
stats_.set_range(0, 16);
for (int i = 0; i < ABSL_ARRAYSIZE(kTestData); ++i)
stats_.add(i, kTestData[i]);
}
void TearDown() {
}
STATS stats_;
};
// Tests some basic numbers from the stats_.
TEST_F(STATSTest, BasicStats) {
EXPECT_EQ(37, stats_.get_total());
EXPECT_EQ(2, stats_.mode());
EXPECT_EQ(12, stats_.pile_count(2));
}
// Tests the top_n_modes function.
TEST_F(STATSTest, TopNModes) {
GenericVector<tesseract::KDPairInc<float, int> > modes;
int num_modes = stats_.top_n_modes(3, &modes);
EXPECT_EQ(3, num_modes);
// Mode0 is 12 1 1 = 14 total count with a mean of 2 3/14.
EXPECT_FLOAT_EQ(2.0f + 3.0f / 14, modes[0].key);
EXPECT_EQ(14, modes[0].data);
// Mode 1 is 2 10 1 = 13 total count with a mean of 5 12/13.
EXPECT_FLOAT_EQ(5.0f + 12.0f / 13, modes[1].key);
EXPECT_EQ(13, modes[1].data);
// Mode 2 is 4 1 1 = 6 total count with a mean of 13.5.
EXPECT_FLOAT_EQ(13.5f, modes[2].key);
EXPECT_EQ(6, modes[2].data);
}
} // namespace
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1627 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o4fW7j4uPre6P7gRXa4C9cNneRKiks5t5JxtgaJpZM4UYRTW>
.
|
Google has 56 files of tests for Tesseract. None of them will work as-is with the GitHub repo, but at least some could be adapted without too much effort. Might be a good starting point, especially for someone like @zamazan4ik who sounds excited about writing or improving test. |
@jbreiden It will be great if you can add them to GitHub for @zamazan4ik to update. Thanks! |
@zamazan4ik, could you please add "RFC:" to the title of this issue ("RFC: Situation with tests in Tesseract")? That makes it clear that it is not a bug report. @jbreiden, I also think that the available test code should be added to git, even if it is currently not integrated in the build process. Please add only text files (test code) to tesseract git. If there are also binaries (images, tessdata), they can be added to https://github.com/tesseract-ocr/test. |
@jbreiden Thank you for the information. Seems like Google has unit-tests. Okay, we can wait for them. But what about regression tests? Has Google anything for this? Or we should prepare images and ground truth? |
@stweil If I have some images with/without ground truth for them, should I add them to https://github.com/tesseract-ocr/test? Are there any special requirements for test data? |
And I want to clarify situation with unit-tests. Do we want to wait for unit-tests from Google or start implement our own unit-tests? |
If you have any specific unitest in mind, please go ahead and implement it. We can add the ones from Google as and when they are added to the repo and modified to work with the code in GitHub. There was discussion in another thread, regarding putting all binaries related to testing in a separate repo (test) which can be invoked as a submodule so that the tesseract repo does not become very large. @zdenop has already created a new repo and the images used by current unittests should also be moved there. Additionally, it might be possible to reduce some test file sizes for image files. |
@Shreeshrii Was any discussion about measuring recognition quality between different Tesseract runs? |
I don't remember such discussions, but I think that measuring the quality (not only for text recognition, but also for layout recognition) should be part of the regression tests. |
No, that was not covered. Google may have these internal tests, since Ray puts statistics in his presentations, but nothing was mentioned in context of the open source code. However, I think it is important to check for regression, at least with some sample images to begin with. The UNLV datasets are only for limited set of languages. I would like us to be able to test each language and script, even if is a with a single one page image. That dataset might take some time to build, but if a framework for that can be setup new language tests can be added as and when the image and matching ground truth becomes available. |
For example the tests should catch cases like: #682 |
I would not use the tesseract-ocr repositories to collect all kinds of ground truth, but of course some examples are needed for the regression tests. We need them to measure the recognition error rate, and we need them if we have tests for training, too. Maybe we could also start a Wiki page https://github.com/tesseract-ocr/tesseract/wiki/Ground-Truth to collect good sources of ground truth, like we collect information on fonts at https://github.com/tesseract-ocr/tesseract/wiki/Fonts. |
Synthetic test data using a single font such as Noto Sans can be built using a single page of training text for all languages. Since all languages will not have high accuracy, we can have a cutoff for accuracy or have a parameter whose value can be set based on language. @stweil had also suggested at one point, to just load all languages to make sure traineddata files are valid and don't crash. |
That is a very good idea. |
I am not sure that we can find easily good source of images and ground truth for Tesseract. But I can prepare manually some of them and publish it under appropriate license. Also there is large image collection here: https://github.com/renard314/textfairy/tree/master/test-images |
Also see tesseract-ocr/tessdata_best#27 (comment) |
Ok. I think I can start work on images and ground truth for them. Also I prefer way to store images and other needed information for tests in tesseract repos, because links have chance to become broken. After getting some information we should decide how to:
Do you have any ideas, how we can measure results? Should we use some special tools or can use something simpler like Levenshtein distance per character/per word? |
Also TextFairy's author allowed use image from TextFairy for Tesseract testing: https://github.com/renard314/textfairy/tree/master/test-images I think I can prepare ground truth for some of them |
Storing binaries in Git results in very large repositories, so there are good reasons to keep the source repository small for builds without tests. Repositories with test data can be included as Git submodules, then there won't be problems with broken links. The tests only have to check whether the needed submodules are available and only run if they do. |
@stweil Of course. There is no reason to store test scripts and test data in one repo. You are right |
I have uploaded the test binaries to tesseract-ocr/test and have created a
PR
#1632
to add it as submodule in tesseract.
@stweil what will be best method for updating the submodule in tesseract as
more test images are added to `test` repo?
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Tue, Jun 5, 2018 at 2:09 AM, Alexander ***@***.***> wrote:
@stweil <https://github.com/stweil> Of course. There is no reason to
store test scripts and test data in one repo. You are right
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1627 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o12V_zYHz0g520_2A1coYJoete1oks5t5ZrygaJpZM4UYRTW>
.
|
As good start point we can upload to our test data repository these files: https://code.google.com/archive/p/isri-ocr-evaluation-tools/downloads Inside we have a lot already binarized images with ground truth for every image. After that we can run Tesseract for every file (we can parallelize it with GNU\Parallel) for speedup. |
Please see https://github.com/tesseract-ocr/tesseract/tree/master/testing
The scripts there need to be updated to use this newer location.
@zdenop also has a copy of the files on sourceforge.
https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Tue, Jun 5, 2018 at 10:59 AM, Alexander ***@***.***> wrote:
As good start point we can upload to our test data repository these files:
https://code.google.com/archive/p/isri-ocr-evaluation-tools/downloads
Inside we have a lot already binarized images with ground truth for every
image. After that we can run Tesseract for every file (we can parallelize
it with GNU\Parallel) for speedup.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1627 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o2vqKPUJnjcV7cCUCDCfJ9VzNsTZks5t5hdBgaJpZM4UYRTW>
.
|
Also see https://github.com/tesseract-ocr/tesseract/wiki/UNLV-Testing-of-Tesseract#example-results for comparison of results up to 3.04.01 |
FYI, so that there is no duplication of efforts. I am changing the current unittests to use the test submodule. Also updating the instructions and scripts for the UNLV tests. |
I did not. |
@stweil I am running into an error with the unlvtests. Need your help to fix. The batch file at the end calls these two c programs. The batch is passing a list of space separated filenames in $accfiles and $wafiles. I tried two variations below, both are getting error. Maybe the string needs to be parsed???
The programs check for the following:
I am getting the error 'not enough input files' even though a long list is given. Here is the output from a test sample.
|
I removed the quote marks from the following and it seems to be working. unlvtests/ocreval/bin/accsum "$accfiles >unlvtests/reports/$setname.characc" |
Does pull request #1640 fix the problem with the UNLV tests? |
Thanks @stweil. I had removed all quotes from those two lines locally and the script was working. I will test with PR #1640 later today. I am wondering though whether the problem is related to using a different version of bash, since the quotes were there in original script too - I was mostly making changes for new path for the files. Here are the results:
I also included 3.02.02 results from https://github.com/tesseract-ocr/tesseract/wiki/UNLV-Testing-of-Tesseract - these were originally reported in the tesseract-ocr forum. |
The original post by Tom Morris was in tesseract-dev group (not tesseract-ocr) Here is the link to the discussion thread which also has info regarding other ground truth datasets. |
Which traineddata did you use? |
I used tessdata_fast. The scripts will need further modifications to take training directory as parameter. The timing results will certainly be different on different machines. I don't know whether accuracy will also change. |
I re-ran the tests for English with tessdata_fast and the numbers are slightly different. The only difference in scripts is conversion of the files to UTF-8 to allow for accented letters. For English, it is é.
|
Anything else should be done here? |
There remain some tests to be fixed:
|
See #1863 (comment) Most of the above unittests can be built and pass. Thanks @stweil. @jbreiden We are still missing some 'testdata', specially for the lstm related tests. Is it possible to get it, as well as the logs from the test run? I can make a list of needed files. |
OSS Fuzz is now supported. |
After going through the experience of undertaking the unit tests for a single contribution (#2294), I have a few suggestions here. Foremost, I believe that the implicit dependency on the data repos should be made explicit (while still being optional): Currently Also, why not simply skip all tests that cannot be satisfied due to missing data repos (instead of failing with crash reports)? Moreover, I suggest including Lastly, IMHO the tesstutorial on training should be fully scripted, with explicit dependencies, incremental stages and automated result verification. The individual commands would have to be split up into recipes of makefile targets I guess. |
In the meantime we have more than 50 working unit tests. These tests are still missing:
I have fixed the code for both, but they fail because of the missing files. |
Thanks @stweil. Do you think that tatweel_test will work if we extract the files from tessdata_fast/ara.traineddata? |
The last subtest requires an old ara.unicharset to test backwards compatibility. I did not find one for that test, not even in the Git history of tessdata. |
tatweel_test is now in Git master and automatically skips the subtests with missing files. Maybe @jbreiden can provide |
@jbreiden provided the ara.* files, and So there are now 60 working unit tests with more than 300 working subtests for Tesseract. |
@stweil and @jbreiden Thank you for getting the unit tests working for tesseract. @jbreiden Are there any regression tests that can be open sourced? Also, the LSTM langdata for Arabic is is the same as 3.0x langdata - only 80 lines of training_text. Is training data from Ray's last traaining run available? @stweil It will be good to split the tests into legacy/LSTM. Then --disable-legacy could be used for testing only the LSTM tests. |
|
@stweil Do you have a test suite for comparing performance + accuracy? It would be good to check the effect of recent changes. |
No, I don't, but would like to have one, too. I recently started comparing the time for The recent changes should not affect accuracy. That has to be tested of course. |
maybe good start point regarding performance measurement is #263 |
Hello.
I have some questions about situation with tests in Tesseract repo.
I try to understand, can I work on this way for Tesseract? Will this work be welcomed?
I suggest discuss here about testing stuff.
The text was updated successfully, but these errors were encountered: