-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558
Conversation
This link doesn't work. Going back through that UAX's history to find the version that was current when this code was added in commit 7a0fedf in 2009-04, we find that that anchor still works in that version: https://www.unicode.org/reports/tr15/tr15-29.html#Annex8 It's a section heading "14. Detecting Normalization Forms". Happily the anchor that the corresponding section heading now offers looks much more reasonable -- it's the title of the section -- and so likely to be long-term stable. ("Annex 8" seems like some kind of editing error.) Switch to that.
The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX python#15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop
This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Looks good. Here are a few nitty comments.
is_normalized(PyObject *self, PyObject *input, int nfc, int k) | ||
// This needs to match the logic in makeunicodedata.py | ||
// which constructs the quickcheck data. | ||
typedef enum {YES = 0, MAYBE = 1, NO = 2} QuickcheckResult; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe makeunicodedata.py
should output this enum (with better name namespacing) then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good idea!
Do you think it'd be cleaner to add that to this PR, or make a small separate one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be done latter, I think.
@@ -208,6 +208,8 @@ def test_issue29456(self): | |||
self.assertEqual(self.db.normalize('NFC', u11a7_str_a), u11a7_str_b) | |||
self.assertEqual(self.db.normalize('NFC', u11c3_str_a), u11c3_str_b) | |||
|
|||
# For tests of unicodedata.is_normalized / self.db.is_normalized , | |||
# see test_normalization.py . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh, I would support merging test_normalization
into this file for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah; as it is, it actually caused me to think this function had evaded being tested at all -- I'd gone as far as to write up some tests myself before I spotted that other file.
A bit of context on someone's thinking in having a separate file can be found above at line 178:
# The rest can be found in test_normalization.py
# which requires an external file.
I don't see why that means the test code should be in a separate file, though. There's already a try/skip mechanism to deal with the external file being unavailable. (I'm guessing that if I looked in the history I'd find that that mechanism wasn't there when the separate file was first added.)
Happy to send a PR to fold that file in.
Modules/unicodedata.c
Outdated
*/ | ||
static QuickcheckResult | ||
is_normalized_quickcheck(PyObject *self, PyObject *input, | ||
int nfc, int k, int yes_only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These "boolean int" parameters could be actual bool
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow it's the future! 🙂 I guess that's a C99 feature, and one of those we now accept using. Good to know.
I'm not sure I want to convert the existing two parameters within this PR; it seems like that could muddy the diff a bit with changes that are orthogonal to what it's really doing. But I guess I can see how it looks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a change to bool
for the new parameter.
I was a bit surprised to find that I needed to add an #include <stdbool.h>
line to the file! Perhaps that should be in a central include file, in the same way as limits.h
and inttypes.h
are. That's certainly orthogonal to this PR, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a quick further commit converting nfc
and k
to bool
too (including in the callers, and using true
/false
rather than 1/0 for literal values.)
I'm ambivalent about adding it here. It expands the diff a bit -- was:
1 file changed, 51 insertions(+), 24 deletions(-)
and becomes:
1 file changed, 57 insertions(+), 30 deletions(-)
Slightly inclined to leave it out, but if you'd rather include it I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to avoid polluting public headers with a stdbool.h
include—bool
is a bit of a name grab in arbitrary mature C codebases. Using stdbool.h
where it's needed is also consistent with IWYU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to avoid polluting public headers with a
stdbool.h
include—bool
is a bit of a name grab in arbitrary mature C codebases.
Ah, fair enough. I think that's consistent with putting it in a generic internal-only header, though. Within the tree, the very few existing uses are either stdbool, or (in Modules/expat/xmltok.c
) a sort of manual stdbool.
Using
stdbool.h
where it's needed is also consistent with IWYU.
Hmm true. I guess IWYU doesn't feel like how this codebase is mostly written... but maybe it's best to make it more so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter?
Here's the extra commit: gnprice@896c970f5
The lines not otherwise touched are where we declare and then fiddle with the nfc
and k
locals in unicodedata_UCD_is_normalized_impl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, we can fix up the rest in a sequel.
* normalized or certainly not, and MAYBE if quickcheck is unable to | ||
* tell. | ||
* | ||
* If `yes_only` is true, then return MAYBE as soon as we determine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes_only
is a bizarre name for a parameter that causes the function to return MAYBE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not really a fan of this name; I'd be glad for a better one.
The idea is that this flag says the caller doesn't care about NO vs. MAYBE; it only cares whether the answer is YES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe invert the sense and call it want_definite
meaning "process the string until we're sure it's not normalized or we reach the end"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The algorithm in the standard is the yes_only = false
(or want_definite = true
) case, so I think things are simplest to describe if that's the baseline and the other case is described by contrast with it. Could certainly handle that with some added words, though.
Another aspect is that there is an asymmetry between a definite YES and a definite NO. In the yes_only
/ !want_definite
case, what the caller really most wants is a definite YES... it's just a definite NO that is no more helpful than a MAYBE. Or another angle on that: if the caller really didn't care about getting a definite answer, it could just assume MAYBE and not bother calling the helper 😉 It's in the hope of getting a YES that it makes the call.
More ideas just now: perhaps maybe_is_no
, to mean "a MAYBE is no worse than a NO"?
Or yes_or_maybe
, for "there are only two distinct answers, YES and MAYBE". That's basically what I was going for with yes_only
, but probably a less paradoxical way of putting it 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API also surprised me. I suggest tho rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).
Hmm, I like the idea of covering over some of the multiplication of flags with wrapper functions!
I don't like quite these names, though. First, I don't think is_normalized
is a good name, for either value of the flag, because the function doesn't actually take full responsibility for finding out if the string is normalized -- it can return MAYBE, and then it's up to the caller to go and normalize the string and compare.
What the function does do is implement a quick but partial check to try to see if the string is normalized. Specifically it implements one that's defined in the standard, which the standard calls quickCheck
in its sample code, and which uses a property in the UCD called Quick_Check
.
More precisely, the yes_only=false case implements that standard quick-check algorithm; the yes_only=true (or yes_or_maybe=true, etc.) case (which is the version in master) is a variation on it. I think the name "quickcheck" or is_normalized_quickcheck
, with no further qualifications, sounds like it will mean the standard's algorithm and it's confusing if it means a different algorithm instead.
Here's another possible set of names to use with the same idea:
- rename this function to
is_normalized_quickcheck_impl
is_normalized_quickcheck
is the standard's quickcheck (yes_only/yes_or_maybe=false)is_normalized_quickercheck
is the variant that's more impatient toreturn
and hand responsibility back to the caller (yes_or_maybe=true)- Or
is_normalized_quickcheck_quicker
- Or
Or if those names feel too long: is_normalized_qc_impl
, is_normalized_qc
, is_normalized_qc_quicker
.
How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or
yes_or_maybe
, for "there are only two distinct answers, YES and MAYBE".
Another idea for the flag name (though its name becomes less important if we use Victor's idea of a pair of wrapper functions): yes_vs_maybe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like anything is standing out as dramatically better. At least the semantics are clearly documented and easily read from the function.
So, let's go with the current state.
Thanks @gnprice for the PR, and @benjaminp for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8. |
GH-15671 is a backport of this pull request to the 3.8 branch. |
…orithm. (pythonGH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX pythonGH-15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop (cherry picked from commit 2f09413) Co-authored-by: Greg Price <gnprice@gmail.com>
GH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX GH-15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop (cherry picked from commit 2f09413) Co-authored-by: Greg Price <gnprice@gmail.com>
Thanks for all the reviews and the merge! @benjaminp , I'll go on and send next some of those followups you suggested. |
Filed bpo-38043 for that, and sent PRs for two of them. |
…ithm. (pythonGH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX python#15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop
…ithm. (pythonGH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX python#15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop
…ithm. (pythonGH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX python#15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop
The purpose of the
unicodedata.is_normalized
function is to answerthe question
str == unicodedata.normalized(form, str)
moreefficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.
However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.
Implement the standard's algorithm. This greatly speeds up
unicodedata.is_normalized
in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithm
returns NO.
At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:
With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:
https://bugs.python.org/issue37966