Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3.11: re.error: global flags not at the start of the expression at position 3 #91

Closed
4 tasks done
mikegerber opened this issue Oct 12, 2023 · 9 comments
Closed
4 tasks done
Assignees
Labels
bug Something isn't working

Comments

@mikegerber
Copy link
Collaborator

mikegerber commented Oct 12, 2023

Possibly a problem only with Python 3.11:

FAILED test/test_recognize.py::test_recognize - re.error: global flags not at the start of the expression at position 3
FAILED test/test_recognize.py::test_recognize_should_warn_if_given_rgb_image_and_single_channel_model - re.error: global flags not at the start of the expression at position 3
FAILED test/test_recognize.py::test_word_segmentation - re.error: global flags not at the start of the expression at position 3
FAILED test/test_recognize.py::test_glyphs - re.error: global flags not at the start of the expression at position 3

Python 3.11 changelog seems to support this assumption.

(This would have been nice to have been caught early, using a linter. → Opening another issue.)


@mikegerber mikegerber added the bug Something isn't working label Oct 12, 2023
@mikegerber mikegerber self-assigned this Oct 12, 2023
@mikegerber mikegerber mentioned this issue Oct 12, 2023
3 tasks
@mikegerber
Copy link
Collaborator Author

mikegerber commented Oct 12, 2023

This does indeed not happen with Python 3.10.12.

@mikegerber
Copy link
Collaborator Author

It's a problem in calamari-ocr, not ocrd_calamari:

/home/b-mg106/devel/ocrd_calamari/ocrd_calamari/recognize.py:129: in process
    for line, line_coords, raw_results in zip(textlines, line_coordss, raw_results_all):
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/predictor.py:250: in predict_raw
    for result in zip(*prediction):
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/predictor.py:167: in predict_raw
    yield PredictionResult(p.decoded, codec=self.codec, text_postproc=self.text_postproc,
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/predictor.py:37: in __init__
    self.sentence = self.text_postproc.apply("".join(self.chars))
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/text_processing/text_processor.py:12: in apply
    return self._apply_single(txts)
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/text_processing/text_processor.py:44: in _apply_single
    txt = proc._apply_single(txt)
/home/b-mg106/.pyenv/versions/tmp.ocrd_calamari.issue-91/lib/python3.11/site-packages/calamari_ocr/ocr/text_processing/text_regularizer.py:350: in _apply_single
    txt = re.sub(replacement.old, replacement.new, txt)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/__init__.py:185: in sub
    return _compile(pattern, flags).sub(repl, string, count)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/__init__.py:294: in _compile
    p = _compiler.compile(pattern, flags)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/_compiler.py:743: in compile
    p = _parser.parse(p, flags)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/_parser.py:980: in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
/home/b-mg106/.pyenv/versions/3.11.3/lib/python3.11/re/_parser.py:455: in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,

calamari-ocr 1.0.6

@mikegerber
Copy link
Collaborator Author

This took me a while to find. The problem lies in the regexen defined in the model!

E.g. 0.ckpt.json:

            {
              "old": "\\s+(?u)",
              "new": " ",
              "regex": true
            },
            {
              "old": "\\n(?u)",
              "regex": true
            },
            {
              "old": "^\\s+(?u)",
              "regex": true
            },
            {
              "old": "\\s+$(?u)",
              "regex": true
            }

@mikegerber
Copy link
Collaborator Author

Fixing the regexen in *.ckpt.json indeed fixes running on Python 3.11. I only tested make test for now, but this is promising.

Q&D script to fix the model:

import re
import json
from glob import glob

for fn in glob("*.json"):
    with open(fn, "r") as fp:
        j = json.load(fp)

    for v in j["model"].values():
        if type(v) != dict:
            continue
        for child in v.get("children", []):
            for replacement in child.get("replacements", []):
                # Move global flags in front
                replacement["old"] = re.sub(
                    r"^(.*)\(\?u\)$", r"(?u)\1", replacement["old"]
                )

    with open(fn, "w") as fp:
        json.dump(j, fp, indent=2)

@mikegerber
Copy link
Collaborator Author

mikegerber commented Oct 16, 2023

master now includes the above script as fix-calamari1-model:

❯ fix-calamari1-model ~/.local/share/ocrd-resources/ocrd-calamari-recognize/qurator-gt4histocr-1.0
 [ ... unrelated numpy warning ... ]
0.ckpt.json fixed.
1.ckpt.json fixed.
2.ckpt.json fixed.
3.ckpt.json fixed.
4.ckpt.json fixed.

This (or something equivalent) should probably go into Calamari's 1.0 branch.

@mikegerber
Copy link
Collaborator Author

Fixing the regexen in *.ckpt.json indeed fixes running on Python 3.11. I only tested make test for now, but this is promising.

ocrd-calamari-recognize also works with the fixed model.

@mikegerber
Copy link
Collaborator Author

I've opened an issue upstream: Calamari-OCR/calamari#348

@mikegerber
Copy link
Collaborator Author

I've opened an issue upstream: Calamari-OCR/calamari#348

And a PR against calamari/1.0 branch that fixes the issue: Calamari-OCR/calamari#349

@mikegerber
Copy link
Collaborator Author

* [ ]  Document the workaround here 
* [ ]  Get a model update/fix procedure upstream

This issue should be enough for documentation, especially since nobody else uses 3.11 yet (fingers crossed). The fix is merged upstream and having a new release is tracked in #94.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant