-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support comparing line GT directories with line OCR directories #64
Comments
What about an even simpler interface:
The existing |
For now and until the interface is finalized I'd like to keep the CLI interface separate, it will share the code anyway.
For the stripping of all extensions to work we would need to assume that the common prefix for a pair does not contain a dot, and the explicit suffix options seemed saner. But I think I'll start implementing this, CLI details can still be refined later. |
They will default to something useful: the longest common suffix, i.e.
(gives |
@stweil Could you test if this works for you? |
The lines also line up perfectly, because each pair is put into its own |
My first test fails:
|
The crash happens in |
@maxbachmann, I now tried to debug the RapidFuzz code, but
|
I can't reproduce with Python 3.9 and rapidfuzz-1.9.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl. Hmm. Could we have a look at the data (or a portion of it) that triggers this? |
@stweil did you clone the repository including submodules?
As @mikegerber mentioned it would help if you could provide me with some data to reproduce this. |
Minimal single line test case (found by bisecting the original large test set):
|
No, I did not. The installation works after |
thanks I could reproduce the crash. I will look into it |
Ouch, I had a typo in the edit distance calculation: rapidfuzz/rapidfuzz-cpp@103674d I released a new version of RapidFuzz with the fix: https://github.com/maxbachmann/RapidFuzz/releases/tag/v1.9.1 |
Great this bug is fixed. I've bumped the rapidfuzz dependency to >=1.9.1! @stweil Could you try https://github.com/qurator-spk/dinglehopper/tree/feat/compare-line-texts again, after updating? |
The feat/compare-line-text branch now also computes WER and word differences. So, if it's tested, it's ready. |
A new test with the latest code shows that the memory issue is fixed, but with the full test set I get a new error (an endless recursion in word_error_rate.py line 25, test data is available online):
|
Great that half of it is working now! Unfortunately I'm on vacation now, so triaging the WER problem will have to wait until January. Thanks for the test data, this will help greatly! |
I've found the problem and fixed it in 8a3f5e4! The feature is now merged.
~ 2 seconds and max. 55MB memory for your example data! 🍾 |
@stweil Let me know if that's working for you! I'll close this issue, feel free to re-open or open another issue if something's still wrong. |
@stweil Did you run the latest version on your full data? Did it work? |
In #62, @stweil's original problem was - as I understand it - to compare a directory with line GT text files with a directory of line OCR text files. For now I've created fake test data to implement this fake-line-gt.zip. It looks like this:
A first implementation should compare the text of pairs files (matching by filename) and produce a report of metrics & differences over all of the lines. First idea of the CLI interface:
I'm not sure if this will be the final CLI interface but it's what seems necessary on first glance.
The text was updated successfully, but these errors were encountered: