bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558

gnprice · 2019-08-28T04:57:50Z

The purpose of the unicodedata.is_normalized function is to answer
the question str == unicodedata.normalized(form, str) more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm. This greatly speeds up
unicodedata.is_normalized in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
    -- 'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
    -- 'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop

https://bugs.python.org/issue37966

This link doesn't work. Going back through that UAX's history to find the version that was current when this code was added in commit 7a0fedf in 2009-04, we find that that anchor still works in that version: https://www.unicode.org/reports/tr15/tr15-29.html#Annex8 It's a section heading "14. Detecting Normalization Forms". Happily the anchor that the corresponding section heading now offers looks much more reasonable -- it's the title of the section -- and so likely to be long-term stable. ("Annex 8" seems like some kind of editing error.) Switch to that.

The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX python#15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop

benjaminp

Thanks. Looks good. Here are a few nitty comments.

benjaminp · 2019-08-29T01:22:22Z

Modules/unicodedata.c

-is_normalized(PyObject *self, PyObject *input, int nfc, int k)
+// This needs to match the logic in makeunicodedata.py
+// which constructs the quickcheck data.
+typedef enum {YES = 0, MAYBE = 1, NO = 2} QuickcheckResult;


Maybe makeunicodedata.py should output this enum (with better name namespacing) then?

Ah, good idea!

Do you think it'd be cleaner to add that to this PR, or make a small separate one?

Could be done latter, I think.

benjaminp · 2019-08-29T01:23:04Z

Lib/test/test_unicodedata.py

@@ -208,6 +208,8 @@ def test_issue29456(self):
        self.assertEqual(self.db.normalize('NFC', u11a7_str_a), u11a7_str_b)
        self.assertEqual(self.db.normalize('NFC', u11c3_str_a), u11c3_str_b)

+    # For tests of unicodedata.is_normalized / self.db.is_normalized ,
+    # see test_normalization.py .


ugh, I would support merging test_normalization into this file for clarity.

Yeah; as it is, it actually caused me to think this function had evaded being tested at all -- I'd gone as far as to write up some tests myself before I spotted that other file.

A bit of context on someone's thinking in having a separate file can be found above at line 178:

# The rest can be found in test_normalization.py # which requires an external file.

I don't see why that means the test code should be in a separate file, though. There's already a try/skip mechanism to deal with the external file being unavailable. (I'm guessing that if I looked in the history I'd find that that mechanism wasn't there when the separate file was first added.)

Happy to send a PR to fold that file in.

Modules/unicodedata.c

benjaminp · 2019-08-29T01:25:41Z

Modules/unicodedata.c

+ */
+static QuickcheckResult
+is_normalized_quickcheck(PyObject *self, PyObject *input,
+                         int nfc, int k, int yes_only)


These "boolean int" parameters could be actual bools.

Oh wow it's the future! 🙂 I guess that's a C99 feature, and one of those we now accept using. Good to know.

I'm not sure I want to convert the existing two parameters within this PR; it seems like that could muddy the diff a bit with changes that are orthogonal to what it's really doing. But I guess I can see how it looks.

Pushed a change to bool for the new parameter.

I was a bit surprised to find that I needed to add an #include <stdbool.h> line to the file! Perhaps that should be in a central include file, in the same way as limits.h and inttypes.h are. That's certainly orthogonal to this PR, though.

I made a quick further commit converting nfc and k to bool too (including in the callers, and using true/false rather than 1/0 for literal values.)

I'm ambivalent about adding it here. It expands the diff a bit -- was:
1 file changed, 51 insertions(+), 24 deletions(-)
and becomes:
1 file changed, 57 insertions(+), 30 deletions(-)
Slightly inclined to leave it out, but if you'd rather include it I'll do that.

What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter?

I think we want to avoid polluting public headers with a stdbool.h include—bool is a bit of a name grab in arbitrary mature C codebases. Using stdbool.h where it's needed is also consistent with IWYU.

I think we want to avoid polluting public headers with a stdbool.h include—bool is a bit of a name grab in arbitrary mature C codebases.

Ah, fair enough. I think that's consistent with putting it in a generic internal-only header, though. Within the tree, the very few existing uses are either stdbool, or (in Modules/expat/xmltok.c) a sort of manual stdbool.

Using stdbool.h where it's needed is also consistent with IWYU.

Hmm true. I guess IWYU doesn't feel like how this codebase is mostly written... but maybe it's best to make it more so.

What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter?

Here's the extra commit: gnprice@896c970f5
The lines not otherwise touched are where we declare and then fiddle with the nfc and k locals in unicodedata_UCD_is_normalized_impl.

Okay, we can fix up the rest in a sequel.

benjaminp · 2019-08-29T01:28:17Z

Modules/unicodedata.c

+ * normalized or certainly not, and MAYBE if quickcheck is unable to
+ * tell.
+ *
+ * If `yes_only` is true, then return MAYBE as soon as we determine


yes_only is a bizarre name for a parameter that causes the function to return MAYBE.

Yeah, I'm not really a fan of this name; I'd be glad for a better one.

The idea is that this flag says the caller doesn't care about NO vs. MAYBE; it only cares whether the answer is YES.

Maybe invert the sense and call it want_definite meaning "process the string until we're sure it's not normalized or we reach the end"?

The algorithm in the standard is the yes_only = false (or want_definite = true) case, so I think things are simplest to describe if that's the baseline and the other case is described by contrast with it. Could certainly handle that with some added words, though.

Another aspect is that there is an asymmetry between a definite YES and a definite NO. In the yes_only / !want_definite case, what the caller really most wants is a definite YES... it's just a definite NO that is no more helpful than a MAYBE. Or another angle on that: if the caller really didn't care about getting a definite answer, it could just assume MAYBE and not bother calling the helper 😉 It's in the hope of getting a YES that it makes the call.

More ideas just now: perhaps maybe_is_no, to mean "a MAYBE is no worse than a NO"?

Or yes_or_maybe, for "there are only two distinct answers, YES and MAYBE". That's basically what I was going for with yes_only, but probably a less paradoxical way of putting it 🙂

The API also surprised me. I suggest tho rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).

I suggest to rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).

Hmm, I like the idea of covering over some of the multiplication of flags with wrapper functions!

I don't like quite these names, though. First, I don't think is_normalized is a good name, for either value of the flag, because the function doesn't actually take full responsibility for finding out if the string is normalized -- it can return MAYBE, and then it's up to the caller to go and normalize the string and compare.

What the function does do is implement a quick but partial check to try to see if the string is normalized. Specifically it implements one that's defined in the standard, which the standard calls quickCheck in its sample code, and which uses a property in the UCD called Quick_Check.

More precisely, the yes_only=false case implements that standard quick-check algorithm; the yes_only=true (or yes_or_maybe=true, etc.) case (which is the version in master) is a variation on it. I think the name "quickcheck" or is_normalized_quickcheck, with no further qualifications, sounds like it will mean the standard's algorithm and it's confusing if it means a different algorithm instead.

Here's another possible set of names to use with the same idea:

rename this function to is_normalized_quickcheck_impl

is_normalized_quickcheck is the standard's quickcheck (yes_only/yes_or_maybe=false)

is_normalized_quickercheck is the variant that's more impatient to return and hand responsibility back to the caller (yes_or_maybe=true)

Or is_normalized_quickcheck_quicker

Or if those names feel too long: is_normalized_qc_impl, is_normalized_qc, is_normalized_qc_quicker.

How does that sound?

Or yes_or_maybe, for "there are only two distinct answers, YES and MAYBE".

Another idea for the flag name (though its name becomes less important if we use Victor's idea of a pair of wrapper functions): yes_vs_maybe.

It doesn't seem like anything is standing out as dramatically better. At least the semantics are clearly documented and easily read from the function.

So, let's go with the current state.

miss-islington · 2019-09-04T02:45:47Z

Thanks @gnprice for the PR, and @benjaminp for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒⛏🤖

bedevere-bot · 2019-09-04T02:45:58Z

GH-15671 is a backport of this pull request to the 3.8 branch.

…orithm. (pythonGH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX pythonGH-15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop (cherry picked from commit 2f09413) Co-authored-by: Greg Price <gnprice@gmail.com>

GH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX GH-15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop (cherry picked from commit 2f09413) Co-authored-by: Greg Price <gnprice@gmail.com>

gnprice · 2019-09-06T04:12:34Z

Thanks for all the reviews and the merge!

@benjaminp , I'll go on and send next some of those followups you suggested.

gnprice · 2019-09-06T05:27:57Z

@benjaminp , I'll go on and send next some of those followups you suggested.

Filed bpo-38043 for that, and sent PRs for two of them.

…ithm. (pythonGH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX python#15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop

gnprice added 3 commits August 27, 2019 19:39

the-knights-who-say-ni added the CLA signed label Aug 28, 2019

bedevere-bot added the awaiting review label Aug 28, 2019

benjaminp reviewed Aug 29, 2019

View reviewed changes

gnprice added 2 commits August 28, 2019 20:43

Move UAX python#15 link to doc-comment.

27e8122

Use bool for a boolean.

3762787

vstinner requested a review from serhiy-storchaka August 29, 2019 13:19

benjaminp added the needs backport to 3.8 label Sep 4, 2019

benjaminp approved these changes Sep 4, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels Sep 4, 2019

benjaminp merged commit 2f09413 into python:master Sep 4, 2019

bedevere-bot removed the awaiting merge label Sep 4, 2019

bedevere-bot removed the needs backport to 3.8 label Sep 4, 2019

gnprice deleted the pr-quickcheck branch September 6, 2019 03:23

gnprice mentioned this pull request Sep 9, 2019

bpo-38043: Move unicodedata.normalize tests into test_unicodedata. #15712

Merged

gnprice mentioned this pull request Apr 10, 2022

small cleanups in Unicode normalization code #82224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558

bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558

gnprice commented Aug 28, 2019 •

edited by bedevere-bot

Loading

benjaminp left a comment

benjaminp Aug 29, 2019

gnprice Aug 29, 2019

benjaminp Aug 29, 2019

benjaminp Aug 29, 2019

gnprice Aug 29, 2019

benjaminp Aug 29, 2019

gnprice Aug 29, 2019

gnprice Aug 29, 2019

gnprice Aug 29, 2019

benjaminp Aug 29, 2019

benjaminp Aug 29, 2019

gnprice Aug 29, 2019

gnprice Aug 29, 2019

benjaminp Sep 4, 2019

benjaminp Aug 29, 2019

gnprice Aug 29, 2019

benjaminp Aug 29, 2019

gnprice Aug 29, 2019

vstinner Aug 29, 2019

gnprice Aug 29, 2019

gnprice Aug 29, 2019

benjaminp Sep 4, 2019

miss-islington commented Sep 4, 2019

bedevere-bot commented Sep 4, 2019

gnprice commented Sep 6, 2019

gnprice commented Sep 6, 2019

bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558

bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558

Conversation

gnprice commented Aug 28, 2019 • edited by bedevere-bot Loading

benjaminp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miss-islington commented Sep 4, 2019

bedevere-bot commented Sep 4, 2019

gnprice commented Sep 6, 2019

gnprice commented Sep 6, 2019

gnprice commented Aug 28, 2019 •

edited by bedevere-bot

Loading