Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid expensive regular expressions for NON_PRINTABLE check #124

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

roehling
Copy link

The NON_PRINTABLE regex is initialized at module import time and takes
significant time to build, especially with the later Unicode versions
and code points beyond 0xFFFF.

Instead, use the unicodedata module to query the character category
and filter for control characters. Apart from avoiding the complex regex
machinery, this is also forward compatible to future Unicode revisions.

@MRigal
Copy link

MRigal commented May 3, 2019

Valuable Pull Request! I guess a simple rebase would fix the failing test on Python 2.6!

The NON_PRINTABLE regex is initialized at module import time and takes
significant time to build, especially with the later Unicode versions
and code points beyond 0xFFFF.

Instead, use the unicodedata module to query the character category
and filter for control characters. Apart from avoiding the complex regex
machinery, this is also forward compatible to future Unicode revisions.
@roehling roehling force-pushed the avoid_expensive_regex branch from 94f6161 to f527a46 Compare May 22, 2019 11:34
@roehling roehling changed the title Avoid expensive regular expressions Avoid expensive regular expressions for NON_PRINTABLE check May 22, 2019
@roehling
Copy link
Author

I rebased as you suggested, and all checks are passing now.

@MRigal
Copy link

MRigal commented May 22, 2019

I have no write rights, but maybe @ingydotnet or @perlpunk could merge that in

@nitzmahone
Copy link
Member

We'll take a look at this one (and/or #301) for the next release. I like this one better in general, but want to do some performance measurement, as well as untangle if the removal of the UCS4 check causes other problems...

For the common case, the loop is about twice as fast now.
@roehling
Copy link
Author

I programmed a benchmark, basically just scanning different text sizes for nonprintable characters.

I ran these on my laptop, Ubuntu 18.04 LTS, 64-bit, Intel i7 6600U @ 2.6 GHz. All times in seconds.

Python 2.7:

Text Length  RegularExpr  Unicodedata
-----------  -----------  -----------
          1     0.048311     0.000020
         10     0.048277     0.000005
        100     0.048278     0.000023
       1000     0.048282     0.000197
      10000     0.048350     0.001777
     100000     0.048936     0.017250
    1000000     0.054718     0.146784
   10000000     0.103229     1.441203

Python 3.6:

Text Length  RegularExpr  Unicodedata
-----------  -----------  -----------
          1     0.005614     0.000010
         10     0.005605     0.000004
        100     0.005606     0.000017
       1000     0.005616     0.000157
      10000     0.005675     0.001591
     100000     0.006265     0.014438
    1000000     0.011326     0.133507
   10000000     0.064768     1.355429

According to these, Python 2 is much slower to build the regular expression. The break-even point, where the overhead of compiling the regular expression is compensated by the faster text scanning is at approx. 400K of text (Python 2) versus approx. 40K of text (Python 3).

For my use cases, the unicodedata approach would be faster. Not sure if this generalizes, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants