[3.8] closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) #15671

miss-islington · 2019-09-04T02:45:56Z

The purpose of the unicodedata.is_normalized function is to answer
the question str == unicodedata.normalized(form, str) more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX GH-15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm. This greatly speeds up
unicodedata.is_normalized in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000'
-- 'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000'
-- 'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the unicodedata.normalize use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000'
-- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000'
-- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
(cherry picked from commit 2f09413)

Co-authored-by: Greg Price gnprice@gmail.com

https://bugs.python.org/issue37966

…orithm. (pythonGH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX pythonGH-15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop (cherry picked from commit 2f09413) Co-authored-by: Greg Price <gnprice@gmail.com>

miss-islington · 2019-09-04T03:03:37Z

@gnprice and @benjaminp: Status check is done, and it's a success ✅ .

miss-islington · 2019-09-04T03:19:53Z

@gnprice and @benjaminp: Status check is done, and it's a success ✅ .

miss-islington · 2019-09-04T03:24:52Z

@gnprice and @benjaminp: Status check is done, and it's a success ✅ .

the-knights-who-say-ni added the CLA signed label Sep 4, 2019

bedevere-bot mentioned this pull request Sep 4, 2019

bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558

Merged

bedevere-bot added the awaiting review label Sep 4, 2019

benjaminp approved these changes Sep 4, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels Sep 4, 2019

miss-islington merged commit 4dd1c9d into python:3.8 Sep 4, 2019

miss-islington deleted the backport-2f09413-3.8 branch September 4, 2019 03:03

bedevere-bot removed the awaiting merge label Sep 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3.8] closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) #15671

[3.8] closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) #15671

miss-islington commented Sep 4, 2019 •

edited by bedevere-bot

Loading

miss-islington commented Sep 4, 2019

miss-islington commented Sep 4, 2019

miss-islington commented Sep 4, 2019

[3.8] closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) #15671

[3.8] closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) #15671

Conversation

miss-islington commented Sep 4, 2019 • edited by bedevere-bot Loading

miss-islington commented Sep 4, 2019

miss-islington commented Sep 4, 2019

miss-islington commented Sep 4, 2019

miss-islington commented Sep 4, 2019 •

edited by bedevere-bot

Loading