-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve matching performance. #293
Conversation
Can't really review this on my phone but looks pretty epic. The trie makes sense: trade a bit of overhead for its construction in exchange for cheaper search space pruning during matching. Great stuff. Thanks for contributing it. |
@kevincox: would you be able to rebase this against "master"? |
If you are interested in merging I gladly will. However it is Dependant on the C Scanner Infrastructure so that will have to be merged first. If that sounds good let me know. I should have some free time this weekend. |
Yeah, definitely interested. I just want to confirm your benchmark results on top of current master before going too far. |
Well I'll try to start porting to current master. Also I realized the reason I wasn't seeing much parallel speedup was because the benchmarks had no limit. Adding a limit the multithreading is quite effective. |
This scanner allows low overhead reading of files. It can currently read paths from an array of strings (good for ruby-based scanners) and a terminated string (great for command executor scanners). This also converts the find scanner and the git scanner to use the string based infrastructure.
This option provides the user with ultimate flexibility. They have the choice to provide any shell command, script or pipeline in order to generate a file list. To give the command more context in the future variables can be set before calling the user command in the shell to pass more information to the external command. This allows future backwards compatible extension. The only drawback I see is that the terminator is "static" for example if a user has a script that switches the technique used based on context (for example current path) it will always have to use the same terminator. Luckily the workaround is simple by using `| tr '\n' '\0'` to convert newlines (or whatever the alternative separator is) to nulls and setting CommandT to always use nulls. This has a slight CPU and memory overhead but since it runs in parallel it should be insignificant on a multi-core machine.
This improves matching performance by using a trie to store the paths. Hit-or-miss matching is then done iteratively, allowing the pruning of subtrees and requiring less work to identify matches. Additionally this structure usually uses much less memory as common prefixes of paths are only stored once. Testing on benchmarks and use cases shows a 2-10x performance improvement for common scenarios. For some edge cases effectively infinite speedup can be seen as huge numbers of paths can be pruned. This is particularly common with a lot of hidden paths. Other notes: - A tiny cache boost might be gained by putting all strings into the same buffer Instead of strdup'ing strings separately. - I considered putting the path segments inline into the paths_t structure but it performed worse. This was surprising but probably due to good cache prediction on the strings being in order in memory. - Most of the time is still spent in calculate_match. While we call this function much less now (on every match rather then every string) it is still expensive. While this might be acceptable because its complexity provides a useful result order it is a good optimization target. - Another boost might be gained by post-processing the paths into a single array that stores the delta from the previous. This would give excellent cache-locality but would make skipping over hidden or mask-failing files difficult/impossible. This would also be difficult/impossible to do threaded.
Without a limit the time is dominated by sorting all the results in a single thread. The limit is configurable so it can be raised for testing merging.
The rebase was simpler then expected. This is now based on master. |
Thanks, @kevincox. So I had a chance to play around with this. The "big" matcher benchmark is indeed a bit faster, but I found in practice that on a large repo (1.5m files) this is significantly slower overall. How large a repo have you been testing this on? I think the approach here shows promise, but it will need to meet and improve the bar on large repos in order to be useful (because the truth is, the perf doesn't matter on small repos, because anything is fast enough there). |
That's odd. I was trying it on a set of ~4M files (my real world use case) and it was significantly faster. Can you explain exactly what your setup is.
|
Hmm, it actually appears like there was a regression in the cleanup. I'll look into it. However even still the performance is better then master for me with ~2M paths and a real query. The numbers below are running the matcher benchmark as it exists in the latest commit in this branch (except master had to pass the paths differently) on a 6 physical 12 logical core linux system.
|
Sorry for the delay, life gets in the way. I looked into it and it appears that the reason the benchmarks look very similar is because this new branch is slightly slower when everything matches. This is because because every item still gets scored then sorted, and scoring is expensive. However the old method scores right away so it is doing the minimal amount of work (not counting cache differences), however it lacks the fast path. Once your selector has much selectivity at all the new version is significantly faster as only matching entries are scored. The benchmark.rb currently benchmarks every prefix of the query so it spends most of it's time on the short non-selective queries. I suspect that this isn't too useful of a measure because I tend to type a number of characters before I wait for the result. So while the first couple of searches are roughtly the same performance it doesn't matter to me because I only start using the results once I have typed a number of characters and the result set is significantly reduced. Benchmark just using the full query:
|
Interesting. Thanks for the analysis.
Yep, this is pretty intentional because it reflects what I think is a fairly common use pattern. (Although I totally take your point about your preference for typing in bursts, and relying on debouncing to prevent intermediate searches from being dispatched.) A couple of thoughts:
Thanks for you work and analysis on this. I'm not going to pull the trigger on this yet because I do think we need to solve the sluggish feel on large repos when not using |
I figured it was, and I knew about it but it slipped my mind. I think in the future it would be benficial to have a couple of different benchmarks so that it is easier to identify where the performance changes, but this is probably sufficent.
I'm not sure it is. I very rarely use non-selective queries. Also I suspect the less selective your query the more you are relying on scoring to suface an interesting result. Searching a large corpus with a non-selective query doesn't seem particularly useful otherwise. There is some low-hanging fruit in this change to make the empty string only use on thread and just read off the first
The first two changes are preserved, but I haven't looked into the partial results one. It could probably save some cycles on the incremental case so I'll look into it. It is unclear if it will be as clear of a win in this case because we already do some pruning, however more pruning is unlikely to hurt.
I haven't tried this, I'll play with the setting and see the effect. |
In order to actually find the needle in the haystack you're generally going to need something pretty "selective", as you say, but that doesn't mean that performance of "non-selective" queries doesn't matter. The benchmarks don't tell the full story; this is why I commented above "I found in practice that on a large repo (1.5m files) this is significantly slower overall" — it felt like treacle, not a subtle effect at all. I didn't need to run the benchmarks to see the regression, and I bet other people working in huge repos would have seen it too; this isn't a true edge-case like the "pathological" dataset in the benchmarks, but the really common starting-to-search case, so it's pretty important that we don't regress it. [Edit: is → isn't] |
I wonder what makes your large repo test slow. Because I tested with 0.5M, 2M and 4M and saw only minor differences (< 20% for empty queries). If you can figure out why your use case is so slow it might be possible to optimize it. I have been using this since before I opened the request originaly with the 2M, but later cut down to 0.5M repo and it feels significantly faster for me in both cases.
For sure, I think it should be possible to make this change with no, or only very minor regressions in the worst cases. |
I'll add a little checklist for planned improvements:
|
Given the big rewrite for v6.0.x, older PRs will need to be re-rolled against |
Note: This is based off of #253 and #258 .
This improves matching performance by using a trie to store the paths.
Hit-or-miss matching is then done iteratively, allowing the pruning of
subtrees and requiring less work to identify matches.
Additionally this structure usually uses much less memory as common
prefixes of paths are only stored once.
Testing on benchmarks and use cases shows a 2-10x performance
improvement for common scenarios. For some edge cases effectively
infinite speedup can be seen as huge numbers of paths can be
pruned. This is particularly common with a lot of hidden paths.
One side note is that I kept the threading feature in although in
practice it seems to provide a very small improvement and only on very
large sets (>1M). Future work will to investigate why there is little
benefit and either fix or remove the threading. (Removing provides a
small performance boost as some checks can be removed).