Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outdated Unicode data in the re module #91575

Closed
2 of 4 tasks
serhiy-storchaka opened this issue Apr 15, 2022 · 2 comments · Fixed by #91580 or #91660
Closed
2 of 4 tasks

Outdated Unicode data in the re module #91575

serhiy-storchaka opened this issue Apr 15, 2022 · 2 comments · Fixed by #91580 or #91660
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes topic-regex type-bug An unexpected behavior, bug, or error type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Apr 15, 2022

  1. The re module contains a table for characters: c1.upper() == c2.upper() and c1 != c2 and c1.lower() == c1 and c2.lower() == c2. For example, 'ς' and 'σ': 'ς'.upper() == 'σ'.upper() == 'Σ'.

    It was generated for 3.5. But newer Python versions support newer Unicode standards, and more such characters were added. For example: 'в' and 'ᲀ': 'в'.upper() == 'ᲀ'.upper() == 'В'.

    Python re lib fails case insensitive matches on Unicode data #56937

  2. The code depends on some assumption about characters outside of the BMP range. The comment says that there are only two ranges of cased non-BMP characters, and that RANGE_UNI_IGNORE works with them.

    Now there are more ranges of cased non-BMP characters. Seems the assumption is still true and RANGE_UNI_IGNORE still works, but the comment is outdated.

    IGNORECASE breaks unicode literal range matching #61583

The plan is:

  • Regenerate the table with actual Unicode versions for all maintained Python versions.
  • Test the assumption and update the comment.
  • Add a script and the make target for generating that table with the actual Unicode version (the developed version only).
  • For the above assumption, either test it in the script, or make the code working in case it is not true.
@serhiy-storchaka serhiy-storchaka added type-bug An unexpected behavior, bug, or error type-feature A feature request or enhancement topic-regex 3.11 only security fixes 3.10 only security fixes 3.9 only security fixes labels Apr 15, 2022
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 15, 2022
@ghost
Copy link

ghost commented Apr 16, 2022

I will do these.

@serhiy-storchaka
Copy link
Member Author

I am already working on this.

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 18, 2022
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 18, 2022
…latest Unicode version (pythonGH-91580).

(cherry picked from commit 1c2fceb)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue Apr 22, 2022
…ing in re (GH-91660)

Also test that all extra cases are in BMP.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Apr 22, 2022
…latest Unicode version (pythonGH-91580). (pythonGH-91661)

(cherry picked from commit 1c2fceb)
(cherry picked from commit 1748816)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue Apr 22, 2022
serhiy-storchaka added a commit that referenced this issue Apr 22, 2022
…Unicode version (GH-91580). (GH-91661) (GH-91837)

(cherry picked from commit 1c2fceb)
(cherry picked from commit 1748816)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
hello-adam pushed a commit to hello-adam/cpython that referenced this issue Jun 2, 2022
…atest Unicode version (pythonGH-91580). (pythonGH-91661) (pythonGH-91837)

(cherry picked from commit 1c2fceb)
(cherry picked from commit 1748816)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes topic-regex type-bug An unexpected behavior, bug, or error type-feature A feature request or enhancement
Projects
None yet
1 participant