-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Description of '\w' behavior is vague in re
documentation
#82747
Comments
The documentation for the
It seems from examining the CPython source³ that SRE treats '\w' as meaning any alphanumeric character OR U+005F SPACING UNDERSCORE, which does not match any Unicode class definition I've been able to find. Can anyone provide clarification on what part of Unicode this documentation is referring to? If there is some other definition, the documentation should be more specific about referring to it (and including a link would be preferred). If instead the documentation is incorrect, this language should be changed to describe the true meaning of \w. ¹ https://docs.python.org/3/library/re.html#index-32 |
The definition of \w, historically, has corresponded to the set of characters that can occur in legal variable names in C (alphanumeric ASCII plus underscores, making it equivalent to [a-zA-Z0-9_] for ASCII regex). That's why, on top of the definitely wordy alphabetic characters, and the arguably wordy numerics, it includes the underscore, _. That definition predates Unicode entirely, and Python is just building on it by expanding the definition of "alphanumeric" to encompass all alphanumeric characters in Unicode. We definitely can't remove underscores from the definition without breaking existing code which assumes a common subset of PCRE support (every regex flavor I know of includes underscores in \w). Adding the zero width characters seems of limited benefit (especially in the non-joiner case; if you're trying to pull out words, presumably you don't want to group letters across a non-joining boundary?). Basically, you're parsing "Unicode word characters" as "Unicode's definition of word characters", when it's really meant to mean "All word characters, not just ASCII". You omitted the clarifying remarks from the documentation though, the full description is:
That's about as precise as I think we can make it (because technically, some of the things that count as "word characters" aren't actually part of an "alphabet" in the technical definition). If you think there is a clearer way of expressing it, please suggest a better phrasing, and this can be fixed as a documentation bug. |
Cheers for the additional context. My recommendation would be to change the language to avoid confusion with the consortium's formal specifications. Describing what SRE does should be fine:
I think it'd also be nice for the term "alphanumeric Unicode character" to link to the documentation for |
Duplicate of #69929 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: