-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid expensive regular expressions for NON_PRINTABLE check #124
base: main
Are you sure you want to change the base?
Conversation
Valuable Pull Request! I guess a simple rebase would fix the failing test on Python 2.6! |
The NON_PRINTABLE regex is initialized at module import time and takes significant time to build, especially with the later Unicode versions and code points beyond 0xFFFF. Instead, use the unicodedata module to query the character category and filter for control characters. Apart from avoiding the complex regex machinery, this is also forward compatible to future Unicode revisions.
94f6161
to
f527a46
Compare
I rebased as you suggested, and all checks are passing now. |
I have no write rights, but maybe @ingydotnet or @perlpunk could merge that in |
We'll take a look at this one (and/or #301) for the next release. I like this one better in general, but want to do some performance measurement, as well as untangle if the removal of the UCS4 check causes other problems... |
For the common case, the loop is about twice as fast now.
I programmed a benchmark, basically just scanning different text sizes for nonprintable characters. I ran these on my laptop, Ubuntu 18.04 LTS, 64-bit, Intel i7 6600U @ 2.6 GHz. All times in seconds. Python 2.7:
Python 3.6:
According to these, Python 2 is much slower to build the regular expression. The break-even point, where the overhead of compiling the regular expression is compensated by the faster text scanning is at approx. 400K of text (Python 2) versus approx. 40K of text (Python 3). For my use cases, the unicodedata approach would be faster. Not sure if this generalizes, though. |
The NON_PRINTABLE regex is initialized at module import time and takes
significant time to build, especially with the later Unicode versions
and code points beyond 0xFFFF.
Instead, use the unicodedata module to query the character category
and filter for control characters. Apart from avoiding the complex regex
machinery, this is also forward compatible to future Unicode revisions.