Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support comparing line GT directories with line OCR directories #64

Closed
mikegerber opened this issue Dec 10, 2021 · 22 comments
Closed

Support comparing line GT directories with line OCR directories #64

mikegerber opened this issue Dec 10, 2021 · 22 comments
Assignees
Labels
enhancement New feature or request

Comments

@mikegerber
Copy link
Member

mikegerber commented Dec 10, 2021

In #62, @stweil's original problem was - as I understand it - to compare a directory with line GT text files with a directory of line OCR text files. For now I've created fake test data to implement this fake-line-gt.zip. It looks like this:

% ls *
gt:
line001.gt.txt  line003.gt.txt  line005.gt.txt  line007.gt.txt  line009.gt.txt  line011.gt.txt
line002.gt.txt  line004.gt.txt  line006.gt.txt  line008.gt.txt  line010.gt.txt

some-ocr:
line001.some-ocr.txt  line003.some-ocr.txt  line005.some-ocr.txt  line007.some-ocr.txt  line009.some-ocr.txt  line011.some-ocr.txt
line002.some-ocr.txt  line004.some-ocr.txt  line006.some-ocr.txt  line008.some-ocr.txt  line010.some-ocr.txt

A first implementation should compare the text of pairs files (matching by filename) and produce a report of metrics & differences over all of the lines. First idea of the CLI interface:

dinglehopper-lines gt/ --gt-suffix .gt.txt some-ocr/ --ocr-suffix .some-ocr.txt

I'm not sure if this will be the final CLI interface but it's what seems necessary on first glance.

@mikegerber mikegerber self-assigned this Dec 10, 2021
@mikegerber mikegerber added the enhancement New feature or request label Dec 10, 2021
@stweil
Copy link
Contributor

stweil commented Dec 10, 2021

What about an even simpler interface:

dinglehopper [OPTIONS] GTDIR OCRDIR [REPORT_PREFIX]

The existing dinglehopper could be extended to accept directory names for its GT and OCR argument and then either strip all extensions when matching ground truth and ocr lines by default or use new optional --gt-suffix and --ocr-suffix options.

@mikegerber
Copy link
Member Author

What about an even simpler interface:

dinglehopper [OPTIONS] GTDIR OCRDIR [REPORT_PREFIX]

The existing dinglehopper could be extended to accept directory names for its GT and OCR argument

For now and until the interface is finalized I'd like to keep the CLI interface separate, it will share the code anyway.

and then either strip all extensions when matching ground truth and ocr lines by default or use new optional --gt-suffix and --ocr-suffix options.

For the stripping of all extensions to work we would need to assume that the common prefix for a pair does not contain a dot, and the explicit suffix options seemed saner.

But I think I'll start implementing this, CLI details can still be refined later.

@mikegerber
Copy link
Member Author

For the stripping of all extensions to work we would need to assume that the common prefix for a pair does not contain a dot, and the explicit suffix options seemed saner.

They will default to something useful: the longest common suffix, i.e.

import itertools


def all_equal(iterable):
    g = itertools.groupby(iterable)
    return next(g, True) and not next(g, False)


def common_prefix(its):
    return [p[0] for p in itertools.takewhile(all_equal, zip(*its))]


def common_suffix(its):
    return reversed(common_prefix(reversed(it) for it in its))


#print("".join(common_prefix(["line001.gt.txt", "line02.gt.txt", "line3.gt.txt"])))
print("".join(common_suffix(["line001.gt.txt", "line02.gt.txt", "line3.gt.txt"])))

(gives .gt.txt)

@mikegerber
Copy link
Member Author

mikegerber commented Dec 13, 2021

dinglehopper-line-dirs gt some-ocr from the feat/compare-line-texts branch now compares the line texts from the gt and some-ocr. It auto-detects the file suffixes. It's WIP - but only WER and word differences are missing.

@stweil Could you test if this works for you?

image

@mikegerber
Copy link
Member Author

The lines also line up perfectly, because each pair is put into its own <div class="row">!

@stweil
Copy link
Contributor

stweil commented Dec 13, 2021

My first test fails:

dinglehopper-line-dirs gt frak2021_1.069 frak2021_1.069
free(): invalid next size (fast)
Aborted

@stweil
Copy link
Contributor

stweil commented Dec 13, 2021

The crash happens in rapidfuzz-1.9.0-py3.9-linux-x86_64.egg/rapidfuzz/cpp_string_metric.cpython-39-x86_64-linux-gnu.so.

@stweil
Copy link
Contributor

stweil commented Dec 13, 2021

@maxbachmann, I now tried to debug the RapidFuzz code, but pip install . fails:

 src/cpp_common.hpp:4:10: fatal error: rapidfuzz/fuzz.hpp: No such file or directory

@mikegerber
Copy link
Member Author

I can't reproduce with Python 3.9 and rapidfuzz-1.9.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl. Hmm. Could we have a look at the data (or a portion of it) that triggers this?

@maxbachmann
Copy link
Contributor

@stweil did you clone the repository including submodules?

git clone --recursive git@github.com:maxbachmann/RapidFuzz.git

As @mikegerber mentioned it would help if you could provide me with some data to reproduce this.

@stweil
Copy link
Contributor

stweil commented Dec 13, 2021

Minimal single line test case (found by bisecting the original large test set):

mkdir a b
echo "Vorjahres.“ (24 % gegenüber 42 %. Daneben auch Anſtiege um 11 %, 22 %, 34 %," >a/demo.txt
echo "PVorſahres.“ (24 0% gegenüber 42 95, Daneben auch Anſtiege um 11 % 22 % 34" >b/demo.txt
dinglehopper-line-dirs a b c

@stweil
Copy link
Contributor

stweil commented Dec 13, 2021

did you clone the repository including submodules?

No, I did not. The installation works after git submodule update --init. I suggest to add that information to the instructions in the README.

@maxbachmann
Copy link
Contributor

Minimal single line test case (found by bisecting the original large test set):

thanks I could reproduce the crash. I will look into it

@maxbachmann
Copy link
Contributor

maxbachmann commented Dec 13, 2021

Ouch, I had a typo in the edit distance calculation: rapidfuzz/rapidfuzz-cpp@103674d
I am honestly surprised, that this never crash on the input of a fuzz testing tool ...

I released a new version of RapidFuzz with the fix: https://github.com/maxbachmann/RapidFuzz/releases/tag/v1.9.1

@mikegerber
Copy link
Member Author

Great this bug is fixed. I've bumped the rapidfuzz dependency to >=1.9.1!

@stweil Could you try https://github.com/qurator-spk/dinglehopper/tree/feat/compare-line-texts again, after updating?

@mikegerber
Copy link
Member Author

The feat/compare-line-text branch now also computes WER and word differences. So, if it's tested, it's ready.

@stweil
Copy link
Contributor

stweil commented Dec 15, 2021

A new test with the latest code shows that the memory issue is fixed, but with the full test set I get a new error (an endless recursion in word_error_rate.py line 25, test data is available online):

$ dinglehopper-line-dirs a b c
Traceback (most recent call last):
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/bin/dinglehopper-line-dirs", line 11, in <module>
    load_entry_point('dinglehopper==0.0.0', 'console_scripts', 'dinglehopper-line-dirs')()
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/click-8.0.3-py3.9.egg/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/click-8.0.3-py3.9.egg/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/click-8.0.3-py3.9.egg/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/click-8.0.3-py3.9.egg/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/cli_line_dirs.py", line 138, in main
    process(gt, ocr, report_prefix, metrics=metrics)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/cli_line_dirs.py", line 67, in process
    l_wer, l_n_words = word_error_rate_n(gt_text, ocr_text)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/multimethod-1.3-py3.9.egg/multimethod.py", line 171, in __call__
    return self[tuple(map(self.get_type, args))](*args, **kwargs)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 76, in word_error_rate_n
    return word_error_rate_n(reference.text, compared.text)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/multimethod-1.3-py3.9.egg/multimethod.py", line 171, in __call__
    return self[tuple(map(self.get_type, args))](*args, **kwargs)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 68, in word_error_rate_n
    compared_seq = list(words_normalized(compared))
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 43, in words
    for word in uniseg.wordbreak.words(s):
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/breaking.py", line 59, in break_units
    for j, bk in enumerate(breakables):
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/wordbreak.py", line 185, in word_breakables
    primitive_boundaries = list(_preprocess_boundaries(s))
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/wordbreak.py", line 153, in _preprocess_boundaries
    prop = word_break(c)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 25, in new_word_break
    return old_word_break(c, index)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 25, in new_word_break
    return old_word_break(c, index)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 25, in new_word_break
    return old_word_break(c, index)
  [Previous line repeated 975 more times]
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/wordbreak.py", line 129, in word_break
    return _word_break(code_point(c, index))
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/db.py", line 75, in word_break
    (ord(u),))
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/codepoint.py", line 127, in ord
    return ord_impl(c, index)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/codepoint.py", line 75, in ord_impl
    return _ord(c if index is None else c[index])
RecursionError: maximum recursion depth exceeded while calling a Python object

@stweil
Copy link
Contributor

stweil commented Dec 15, 2021

With commits cb2be96 and 5b39464 reverted (= no WER), my full data set is processed in 5 seconds (no crash).

@mikegerber
Copy link
Member Author

Great that half of it is working now! Unfortunately I'm on vacation now, so triaging the WER problem will have to wait until January. Thanks for the test data, this will help greatly!

@mikegerber
Copy link
Member Author

mikegerber commented Jan 24, 2022

I've found the problem and fixed it in 8a3f5e4! The feature is now merged.

% /usr/bin/time -f'%e %M' dinglehopper-line-dirs a b
2.19 54028

~ 2 seconds and max. 55MB memory for your example data! 🍾

@mikegerber
Copy link
Member Author

@stweil Let me know if that's working for you! I'll close this issue, feel free to re-open or open another issue if something's still wrong.

@mikegerber
Copy link
Member Author

@stweil Did you run the latest version on your full data? Did it work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants