-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching between runs for better performance #530
Comments
I am still undecided as to if PHPCS itself should be doing things like this or if a wrapper script should be used, like what is done for hooking into a VCS. Of course, it's possible, and probably fairly easy to get going, but I would not introduce it into the current 2.x versions regardless. I'd start with the 3.0 version, which is still in heavy development here: https://github.com/squizlabs/PHP_CodeSniffer/tree/3.0 How it is done is 1 of 2 options: something in the core and some command line values, or a new script like the SVN pre-commit hook or phpcbf. Either way, it's going to be easier/cleaner to do in the 3.0 version because of the refactoring I've done there, but I'm not ready to accept any new features on that branch at the moment. |
When you are ready, let me know, and if I can I'll try to make up a PR. |
We can start talking about it now. The most important bit is deciding on an implementation before starting work. Version 3 has a FileList class that just creates a list of files from a file system location. With some changes, different types of file lists could be provided to the core processing code to generate their file lists from various sources:
Or, 2 and 3 could be implemented using some sort of filter class, which is passed to the main FileList. I think I prefer this than extending a base class, but I haven't thought about it long enough to be 100% sure. Interested to hear your thoughts. |
One more thing to add. I think it would be good if people could specify their own filters (assuming we go that way) on the command line just like they can do for reports. So a command like |
I think the filters idea is a good one. I don't think it is a complete solution though, because there are probably other things which could be cached as well. For example, it might be beneficial to cache the parsed rulesets. |
One possible drawback to filters versus child classes, is that the FileList class would still be traversing the whole directory unnecessarily (e.g., when using the git filter). Or were you thinking it would be implemented in a way that would avoid that? |
Maybe the |
I was thinking more about file modification times more than anything else, which really needs all the normal recursive directory iterator stuff, with an added check (once a file path is found) to check if it needs rechecking. But finding files is very fast, so I don't see why the same process wont work for Git. You could use the output from a git command and parse the paths in the output to support specified files. directories and the local flag. But it might just be easier to let the recursive scan happen and then match the found file paths to a list of ones found by git. You don't always want every modified or uncommitted file. Even though there is a cache of the ones that are modified, you may want to limit that by file type, extension, path etc. I think it would be much easier if FileList was in charge of finding candidate files based on that logic, and a filter kicked to limit the candidates based on the filter logic. If could do this after the candidates have all been found, or during the recursive scan. After is pretty easier, but during might be more efficient. If, for example, a directory does not contain any uncommitted files, you could have the FileList skip that dir and continue with the rest. A filter checking the modification times (or hashes) of files couldn't do that, but that's not a big deal. |
One thing I actually didn't address in my previous comments was the initial request 😄 I got carried away talking about other related feature requests that I've received. The filters are good for limiting the things that need to be checked, but we do need to inject the error reports for files that are being skipped from checking (due to cache) but the user still requested to have a report generated for. In this case, a filter might still be useful as it can block the checking of a file and return the report instead, but I think that would take a fair bit of refactoring in the code, and might make things worse. What would be better is to just implement this functionality in the LocalFile class, which is loaded with a file path and asked to process (tokenize and check) itself by the Runner. Instead of always choosing to process, it could instead load itself with a set of cached error messages if the file has not changed on disk. The DummyFile class (used mainly for STDIN) would not include this check as it doesn't have a file system location. So a filter can kick in to limit the files to process, and the file itself could include a hash or modification time check to determine if it needs to be processed again. Two features. It feels like each checked file should have its own cache so that filters and command line arguments don't get in the way of each other, but that would create a lot of files. Instead, PHPCS might need a Cache handler somewhere where files can register caches with a particular key. Same result, but a single cache file instead of hundreds. The cache handler could be responsible for maintaining the overall state of the run (standard and options used) and so could keep multiple caches if needed. |
I've put together a quick implementation of a hash-based cached based on my comment above. It's missing a lot of features, but I wanted to see if the implementation would work and what the results would be like. Obviously, a pretty big improvement in performance, for really minimal code. This is the first run, with no cache file:
This is the second run, with the cache file in place:
Running over the whole PHPCS dir is a difference between 22.13secs and 375ms, so this is good. But the cache file itself (which I'm json pretty printing at the moment) is 7.2M, which isn't that great. If I turn off pretty printing, it comes down to 1.5M, but is now unreadable, so a different format could even be chosen if it ends up being faster. Still, I like JSON. I'll commit what I have after a bit more cleanup and we can take things from there. |
I've pushed some commits for this. The main one is this e5cc0ab But forgot unit testing, so committed these 2 fixes as well: f558de5 1471a78 If you use the If you run over part of the code base in one run, and another part during another run, but use the same config, the same cache file will be used (it will just get bigger). Similarly, if you run over your entire code base and cache everything, you can then do another run limiting the files to check and the cache file will still be used. I haven't committed anything to do with filtering of the file list. This is still pretty dirty, so I'd appreciate any testing that anyone can do, and ideas for how to make things better. One of the decision I had to make was where to put the cache files. I decided on the current working directory instead of the temp dir for 2 reasons (1 good reason, 1 stupid reason):
Number 1 could be worked around by including the current dir in the file hash, but you lose number 2 by doing that. I'm still not sure what the best place for these files is. |
Agreed, but
Yes, they can, but since code base changes all the time the developers need to run Making cache directory configurable (e.g. |
Also the Any of these changes should invalidate cache:
What I believe would be correct cache key detection is:
|
+1 for configurable cache dir |
I'm surprised I didn't include that option in my comment because I had it in my notes. Yes, this is also exactly what I was thinking, and for the exact reasons you've list. The real question though is if the file should be in the system temp dir instead. So my plan was to make the system temp dir the default file location but allow it to be changed using a CLI arg or config var. Sound ok?
I can't detect that the PHP code inside a sniff file has changed. But I can hash the parsed ruleset object and include that in the main cache hash in case you are tweaking the ruleset.xml file. I already include all relevant CLI and config arguments in the cache hash, and do the hashing just before the run is about to commence, so I think the only change required is to look at the parsed ruleset. If I ever add the ability to change the ruleset used in each directory, life might get hard for the caching system. But I guess I can fix that when it happens. |
It could create problem on developer machine because errors from all projects would end up in same file (by default) and cache reading time for all projects could increase if single large project on developer machine will be cached. But in my particular case I'm specifying absolute path to be scanned to phpcs and therefore I'll end up with different caches per project all stored in temp dir, which is very good.
Remembering filesize of the sniff would be enough (faster then doing crc on it), since any significant change to code would result in file size change. |
I think if the default location is the temp dir, then part of the project or analyzed path(s) should be somehow present in the name too
|
If only we could easily detect where the project root it. For example in above cases the project folder obviously (to human) is phpcs --standard=/path/to/CustomStandard /Users/alex/Projects/project_a/the_file.php
phpcs --standard=/path/to/CustomStandard /Users/alex/Projects/project_a/sub_folder But computer can't really guess that. If only we could ensure some kind of marker (e.g. By the way is the |
Knowing the project root is the major problem. You can't just include the path you are checking in the hash or filename because then checking a sub-dir of the project will force a completely new cache to be used, even thought the files themselves have already been checked. But I really don't know how to determine the project root automatically. Using the phpcs.xml file is one possible option. If you include that in the root of your project, PHPCS will find it when no standard is given and use it like a ruleset (it sets project defaults and works better in 3.0). The fact that it exists at a particular location means that it is sitting in the project root, or in a sub-project under the main project root (presumably with different rules). This would force the use of a phpcs.xml file for the best possible caching, but we'd still need sensible defaults.
I don't know what file you are talking about. |
The file, that can be used as per-project PHP_CodeSniffer.conf file. It guess it's |
Yes, I'd like to have that option.
I think that would be a good default, falling back to the temp directory.
I think this would be useful.
Then what happens if you add a rule to the ruleset? I'm guessing that PHPCS will run all of the rules over the files. Would it be possible to detect which rules have been added/changed/removed and only run them? It might also be nice to have a command that will clean the cache, deleting all cache files that don't match the current configuration. But I'm not sure if that would be possible the way its working currently. It seems to me like right now it will be subject to lots of cache bloat over time as the ruleset changes. |
Yes, this way we can delete the cache without even knowing where it's located. |
Yep. It would have to do them again.
That would require a completely different setup for the run and some sort of merge code for the resulting checks. The same would be true if you ran PHPCS with a single sniff after running an entire standard. All the errors are there, so the file just needs to filter them based on the sniffs you have asked to filter with. It's possible, but much more complex code. I think we need to get the basics right first, but can then come back to this. We've also spoken a lot about what happens when rulesets are changing, and sniff PHP code is changing, but this is not what the vast majority of developers are doing. They are running PHPCS over their changing codebase and not over a changing standard. The standard will get updated from time to time, but I think it is really important to not design a system that is painful and/or slow because we want to use caching while we are also tweaking standards. A command to wipe the cache is a given. If a developer updates the coding standard (maybe they pull a new version) or if they update PHPCS itself, they will need to clear the cache. It would be nice if they didn't have to remember to do that, but it might be necessary. By looking at everything that gets loaded during the run (the autoloader keeps track of this) then we might be able to check if any piece of code has changed. I'll give it a try. |
…nerate a more accurate hash (ref #530) If the PHPS core code updates (as with an upgrade) or if the ruleset changes which sniffs are loaded, or if the PHP code inside a sniff file changes (as with a standard update) then a new hash will be generated.
Apparently I can, and have committed that change as well. Now if any of the PHPCS core code changes, or if the loaded sniffs change, or if the code in the loaded sniffs change, the cache is invalidated. |
That's great news. |
Thank you for all your hard work on this @gsherwood! |
Thanks for the idea. Not done yet though. The things to address still are:
Possible solutions:
|
I think that this combination of options would be good. I do have one concern, and that is, I sometimes have the phpcs.xml file symlinked from a different directory. In this case, I'd want the cache file to be stored in the directory the symlink is in, not the directory that it is being symlinked from. But I guess if it didn't work that way I could easily use the CLI option to do what I want. |
…s done at the end of a file run so the cache can be used even with filter (ref #530) The cache key is now generated by looking for loaded external and internal sniff files, then looking for all important core files, and hashing the result. Use -vv to see what files are being used for the hash, and other cache key values. The cache key no longer includes severity or sniffs (removed previously) as files will record all errors and warnings during a cache run no matter what the filter settings are. Local files then replay the errors and warnings found with the filters on, removing the ones that should not be shown. Subsequent runs can then use the cache and reply the errors manually instead of tokenizing and checking the file. The result is that a filtered run will now cache everything needed for future unfiltered runs. It makes the first run a bit slower, but the cache much more useful. To help with all this, error messages must now have codes. You can no longer leave them blank, which makes them impossible to override or filter in rulesets anyway.
I've committed a change to solves issues 2,3,4 and 5 above. The last thing I need to sort out is cache file storage and clearing. More info about what I ended up doing is in the commit message. |
The paths specified for checking and used to locate a common path, which is used to generate the cache file name. When checking a new set of code, if a cache file already exists in a common location, that cache file will be reused and added to instead of a new one being created. So it is possible for a cache file to store the results of multiple projects if you first run over all the projects at once. Running over each individually will still end up finding that first cache file.
Cache files are now stored in the temp dir. See commit above for info. I still need to add a new option to allow a directory to be specified instead of the system temp dir. If a directory is specified, I wont bother checking for common paths, or using the common path SHA1 in the cache file name, which makes things a little easier. |
I think I'm going to leave out the option of setting your own cache directory or cache file location until after this feature gets used a bit. Making it more complex is probably not the right thing to do at this stage. |
I changed my mind on the cache file bit. You can now pass This may become a non-issue if support is added for setting the cache file in a ruleset.xml file using a path relative to the ruleset itself. |
First let me thank you for this great tool. 👍
I've been using this on my PHP projects, and I've found that it can take a while to sniff the code, especially on larger projects with complex configurations. Performance will naturally be determined largely by how well the sniffs used are written. However, I think that performance could be increased by caching the hash signatures of the files being sniffed. Then only those files which have changes since the last sniff was conducted would need to be sniffed (there are some caveats which I'll get to in a moment). This wouldn't improve the performance of the initial sniffing (and might even degrade it slightly), but would drastically improve performance for latter sniffings.
As I noted above, there are some caveats:
There are probably other things I haven't thought of, maybe regarding interactive mode, reports, or automatic fixing, all of which I am unfamiliar with. And there would probably need to be an easy way for the user to bypass the cache as needed.
There are probably also other things that could be cached between runs on a project as well.
Exactly how the cache is saved is up to you. I was thinking of a
.phpcs-cache
in the root of the project being sniffed that would contain the cache represented as a JSON object.If this is something that you think could be done, I'd be happy to work up a PR if you'll give me a little guidance on how you'd like this implemented.
The text was updated successfully, but these errors were encountered: