Avoid expensive regular expressions for NON_PRINTABLE check #124

roehling · 2018-01-29T15:05:14Z

The NON_PRINTABLE regex is initialized at module import time and takes
significant time to build, especially with the later Unicode versions
and code points beyond 0xFFFF.

Instead, use the unicodedata module to query the character category
and filter for control characters. Apart from avoiding the complex regex
machinery, this is also forward compatible to future Unicode revisions.

MRigal · 2019-05-03T12:06:27Z

Valuable Pull Request! I guess a simple rebase would fix the failing test on Python 2.6!

The NON_PRINTABLE regex is initialized at module import time and takes significant time to build, especially with the later Unicode versions and code points beyond 0xFFFF. Instead, use the unicodedata module to query the character category and filter for control characters. Apart from avoiding the complex regex machinery, this is also forward compatible to future Unicode revisions.

roehling · 2019-05-22T11:47:13Z

I rebased as you suggested, and all checks are passing now.

MRigal · 2019-05-22T12:01:21Z

I have no write rights, but maybe @ingydotnet or @perlpunk could merge that in

nitzmahone · 2019-06-19T17:58:19Z

We'll take a look at this one (and/or #301) for the next release. I like this one better in general, but want to do some performance measurement, as well as untangle if the removal of the UCS4 check causes other problems...

For the common case, the loop is about twice as fast now.

roehling · 2019-06-20T14:53:30Z

I programmed a benchmark, basically just scanning different text sizes for nonprintable characters.

I ran these on my laptop, Ubuntu 18.04 LTS, 64-bit, Intel i7 6600U @ 2.6 GHz. All times in seconds.

Python 2.7:

Text Length  RegularExpr  Unicodedata
-----------  -----------  -----------
          1     0.048311     0.000020
         10     0.048277     0.000005
        100     0.048278     0.000023
       1000     0.048282     0.000197
      10000     0.048350     0.001777
     100000     0.048936     0.017250
    1000000     0.054718     0.146784
   10000000     0.103229     1.441203

Python 3.6:

Text Length  RegularExpr  Unicodedata
-----------  -----------  -----------
          1     0.005614     0.000010
         10     0.005605     0.000004
        100     0.005606     0.000017
       1000     0.005616     0.000157
      10000     0.005675     0.001591
     100000     0.006265     0.014438
    1000000     0.011326     0.133507
   10000000     0.064768     1.355429

According to these, Python 2 is much slower to build the regular expression. The break-even point, where the overhead of compiling the regular expression is compensated by the faster text scanning is at approx. 400K of text (Python 2) versus approx. 40K of text (Python 3).

For my use cases, the unicodedata approach would be faster. Not sure if this generalizes, though.

roehling force-pushed the avoid_expensive_regex branch from 94f6161 to f527a46 Compare May 22, 2019 11:34

roehling changed the title ~~Avoid expensive regular expressions~~ Avoid expensive regular expressions for NON_PRINTABLE check May 22, 2019

roehling mentioned this pull request May 22, 2019

Create NON_PRINTABLE regex at runtime #301

Open

nitzmahone added the task:tech-debt label Jun 19, 2019

truthbk mentioned this pull request Jun 20, 2019

[six] initialize pyyaml: cache relevant pyobjects, and use C-extensions DataDog/datadog-agent#3722

Merged

Optimize evaluation order

ccda847

For the common case, the loop is about twice as fast now.

perlpunk mentioned this pull request Jan 7, 2020

PyYAML 5.3 not compatible with Jython #369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid expensive regular expressions for NON_PRINTABLE check #124

Avoid expensive regular expressions for NON_PRINTABLE check #124

roehling commented Jan 29, 2018

MRigal commented May 3, 2019

roehling commented May 22, 2019

MRigal commented May 22, 2019

nitzmahone commented Jun 19, 2019

roehling commented Jun 20, 2019

Avoid expensive regular expressions for NON_PRINTABLE check #124

Are you sure you want to change the base?

Avoid expensive regular expressions for NON_PRINTABLE check #124

Conversation

roehling commented Jan 29, 2018

MRigal commented May 3, 2019

roehling commented May 22, 2019

MRigal commented May 22, 2019

nitzmahone commented Jun 19, 2019

roehling commented Jun 20, 2019