Implemented RecursiveFilterIterator to improve exclude file performance #620

oligriffiths · 2015-06-08T18:48:44Z

The current exclude mechanism is very slow for a large number of excludes.
For each file to test, the entire excludes array is iterated over and a regex is performed.
For our codebase, this takes over a minute for a single test (we have 1800 excludes).

The reason we have so many excludes is we want to switch from one coding standard to another, so we include all the current files in one standard and exclude in another.

This implementation uses a RecursiveFilterIterator and an improved exclude feature to speed up the process. It now takes just over 1 second instead of 1 minute to run out test suite.

oligriffiths · 2015-06-08T18:49:12Z

Note that the tests pass, but the test suite doesn't pass on master.

gsherwood · 2015-06-09T01:50:43Z

I really don't want to refactor how the directory is scanned in the 2.x version because it has been stable for too long. You might want to look at the 3.x branch, where I've already done some refactoring around how directories are searched.

For 2.x, it would be best if you could just tell me which bit of your code made the most difference, or re-submit a PR with the smallest changeset possible to achieve the performance increase.

On a more general note, if you have 1800 exclude patterns, then you are clearly in need of a completely different feature to help you manage things. If you give me more of an idea of what you are doing, maybe we can design a feature together that makes life easier for you.

Also, the reason the build is failing is because you have coding standard errors in the PR. It's not a big deal - I would need to add comments in your new class if I merged this PR anyway.

oligriffiths · 2015-06-09T14:35:20Z

Sure thing, ill look into the 3.0 dev branch.

So let me explain the use case here.

We have a codebase that currently uses a subset of the PSR1/2 standard, with a bunch of exclusions and custom sniffs, snake case instead of camel case for example. We're talking about moving to a full PSR2 standard without the exclusions. Obviously, things would blow up if we just switched, so we want a way to run one ruleset across the codebase as it currently stands, and run the new ruleset against new files only.

In order to do that, we need to produce an index of all the current files, and use that for one ruleset, and exclude it from the other ruleset. Currently, the only way to do this is to include those files in a ruleset using <files> and exclude it from a ruleset using <exclude-pattern>.

The optimizations to the ignoreFile() method, checks the file using an index based lookup against a set of standard patterns: (dir/path.php, ./dir/path.php, ^dir/path.php, ^dir/path.php$, ./dir/path.php$, dir, dir/, dir/*, ^dir/*, */dir/*), which is way faster than iterating the list. In the case of our new ruleset, we must exclude all the current files, so this was very slow, iterating 1800 files for each file to test :(

The other issue with this approach, is that you can't process the entire codebase in one pass, you have to run phpcs with 2 different commands, one using the legacy ruleset, and one using the new ruleset.

Now if we're talking about adding features, for 3.0, a really useful feature would be to be able to have 2 separate standards, and be able to run each over a different set of files. An alternative approach would be to allow the file to contain a dockblock directive that would inform phpcs which standard/ruleset to use.

Thoughts?

oligriffiths · 2015-06-09T15:02:20Z

Additionally, the new class, optimizes the filter iteration performance, if a file is to be excluded, then the iterator never iterates over it, additionally, if a directory is to be excluded, ALL subdirectories are then skipped, the current implementation will iterate over all subdirectories and perform a regex match check over them, for every file in that subdirectory, which is pointless if the directory is being ignored.

…atching

gsherwood · 2015-06-17T01:33:40Z

Thanks for describing your use case. It makes sense why you've optimised things the way you have, but it is only optimising for a very specific case and I'm sure we can figure out a better way of doing things for you.

One obvious suggestion would be to tell PHPCS which files to process instead of which files to exclude. If you're already generating an excluded file list, outputting the opposite seems like it might be a simply thing to do. Writing a wrapper around PHPCS is another option, so you can inject the file list (presumably stored next to the rulesets) instead of including the files in a ruleset.

Ultimately, you're probably going to want some sort of Filter class that you can write and supply to PHPCS. I've described that concept briefly in another conversation about result caching in 3.0 here: #530

The idea being that once PHPCS has a list of files to process, a series of built-in filters can be applied (things like regex exclude patterns, modification date before/after x, git status) and/or custom filters can be applied for very custom filtering rules.

If this sounds like it might be something that could work for you, let's use this issue to figure something out and we can get some code going. The caching stuff on that other issue is pretty much done, so that's a lot of refactoring that's already been done.

oligriffiths · 2015-06-17T02:13:07Z

The only reason I'd shy away rom producing a file list is that once we switch standards, the excludes will stay fixed, all new files will then conform to the new standard. I don't want to have to have scripts that we have to run every time a new file is added, that just won't work for our org.

We basically want an include list for the old rule set and an exclude list for the old rule set. The current implementation is very wasteful when doing non-pattern based matches.

Additionally, the current method will test all files in a subdirectory against a pattern match, even if the directory is excluded. The filter ensures that files in a directory that is excluded aren't event processed. Much more efficient.

oligriffiths · 2015-07-09T20:58:15Z

Any update on this?

gsherwood · 2015-07-13T02:59:27Z

Finding you a good solution is still on my todo, but I don't have time at the moment.

oligriffiths · 2016-01-27T18:26:13Z

@gsherwood Any progress on this?

gsherwood · 2016-01-27T21:05:18Z

Any progress on this?

None. Still on my todo list though.

gsherwood · 2016-02-10T04:24:53Z

Related commit: 4315292

oligriffiths · 2016-02-10T20:26:58Z

@gsherwood looking sharp, the git diff one is good, we currently use this via a custom script and comments back on github!

The reason we wanted this all in the first place, is to apply different sniffs to different files. For example, we want to apply one siff to new files, and a different one for old files, so we made an index of all current files, and exclude those from the new sniff, but can't do the inverse.

gsherwood · 2016-02-11T00:17:58Z

@oligriffiths Sorry, I didn't have time to explain the change yesterday and left that comment as a note for myself.

I am building that filtering feature with the ability to have your own custom filter classes, like you can with reports. One of the options we discussed in this PR is for you to have a custom filter so you can do whatever you need to do.

For your specific case, you'll probably not make use of any PHPCS ignore patterns and instead use a filter (one for each of your standards) where you can do exact match checking on the file paths, or create something completely custom that doesn't even extend the PHPCS base Filter class. That allows PHPCS filtering to still be pretty basic but for you to do something more efficient for your specific case.

But I haven't done any dev on custom filters yet. Shouldn't take too long though. (Edit: it's done now)

gsherwood · 2016-02-11T00:45:34Z

Another related commit: f90e08c

oligriffiths · 2016-02-11T00:50:11Z

Awesome. Sounds good. I like the direction.

On 10 Feb 2016, at 19:45, Greg Sherwood notifications@github.com wrote:

Another related commit: f90e08c

—
Reply to this email directly or view it on GitHub.

gsherwood · 2016-07-26T05:48:52Z

Closing this in favour of custom filter classes in 3.0. 3.0.0a1 is now released, so it can be tested.

Implemented RecursiveFilterIterator to improve exclude file performance

e02ef69

Oli Griffiths added 2 commits June 8, 2015 17:07

Fixed invalid condition

060ec17

Ensure directories are not checked for file extension

833bdf1

gsherwood added Type: enhancement Status: awaiting feedback labels Jun 9, 2015

Oli Griffiths added 8 commits June 9, 2015 11:09

Cleaned up and added comments to PHP_CodeSniffer_ExcludeFilter

41dabcd

Updated ignore handling to separate regex matching vs known pattern m…

d058931

…atching

Changed setIgnorePattern to addIgnorePattern

4068003

Changed addIgnorePattern visibility to protected

6923904

Comment update

8043796

Added dockblock

99c7141

Removed duplicated code

4de8888

Added ExcludeFilter to phar package

fd44c6b

Re-implemented setIgnorePatterns()

9a52341

gsherwood removed the Status: awaiting feedback label Jul 13, 2015

Merge branch 'master' into feature/exclude-performance

051c3f1

gsherwood mentioned this pull request Mar 24, 2016

directories are still traversed even if specified in --ignore #933

Closed

gsherwood closed this Jul 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented RecursiveFilterIterator to improve exclude file performance #620

Implemented RecursiveFilterIterator to improve exclude file performance #620

oligriffiths commented Jun 8, 2015

oligriffiths commented Jun 8, 2015

gsherwood commented Jun 9, 2015

oligriffiths commented Jun 9, 2015

oligriffiths commented Jun 9, 2015

gsherwood commented Jun 17, 2015

oligriffiths commented Jun 17, 2015

oligriffiths commented Jul 9, 2015

gsherwood commented Jul 13, 2015

oligriffiths commented Jan 27, 2016

gsherwood commented Jan 27, 2016

gsherwood commented Feb 10, 2016

oligriffiths commented Feb 10, 2016

gsherwood commented Feb 11, 2016

gsherwood commented Feb 11, 2016

oligriffiths commented Feb 11, 2016

gsherwood commented Jul 26, 2016

Implemented RecursiveFilterIterator to improve exclude file performance #620

Implemented RecursiveFilterIterator to improve exclude file performance #620

Conversation

oligriffiths commented Jun 8, 2015

oligriffiths commented Jun 8, 2015

gsherwood commented Jun 9, 2015

oligriffiths commented Jun 9, 2015

oligriffiths commented Jun 9, 2015

gsherwood commented Jun 17, 2015

oligriffiths commented Jun 17, 2015

oligriffiths commented Jul 9, 2015

gsherwood commented Jul 13, 2015

oligriffiths commented Jan 27, 2016

gsherwood commented Jan 27, 2016

gsherwood commented Feb 10, 2016

oligriffiths commented Feb 10, 2016

gsherwood commented Feb 11, 2016

gsherwood commented Feb 11, 2016

oligriffiths commented Feb 11, 2016

gsherwood commented Jul 26, 2016