Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.8] closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) #15671

Merged
merged 1 commit into from
Sep 4, 2019

Conversation

miss-islington
Copy link
Contributor

@miss-islington miss-islington commented Sep 4, 2019

The purpose of the unicodedata.is_normalized function is to answer
the question str == unicodedata.normalized(form, str) more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX GH-15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm. This greatly speeds up
unicodedata.is_normalized in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000'
-- 'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000'
-- 'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the unicodedata.normalize use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000'
-- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000'
-- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
(cherry picked from commit 2f09413)

Co-authored-by: Greg Price gnprice@gmail.com

https://bugs.python.org/issue37966

…orithm. (pythonGH-15558)

The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX pythonGH-15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
(cherry picked from commit 2f09413)

Co-authored-by: Greg Price <gnprice@gmail.com>
@miss-islington
Copy link
Contributor Author

@gnprice and @benjaminp: Status check is done, and it's a success ✅ .

@miss-islington miss-islington merged commit 4dd1c9d into python:3.8 Sep 4, 2019
@miss-islington miss-islington deleted the backport-2f09413-3.8 branch September 4, 2019 03:03
@miss-islington
Copy link
Contributor Author

@gnprice and @benjaminp: Status check is done, and it's a success ✅ .

1 similar comment
@miss-islington
Copy link
Contributor Author

@gnprice and @benjaminp: Status check is done, and it's a success ✅ .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants