Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve matching performance. #293

Closed
wants to merge 4 commits into from
Closed

Conversation

kevincox
Copy link

Note: This is based off of #253 and #258 .

This improves matching performance by using a trie to store the paths.
Hit-or-miss matching is then done iteratively, allowing the pruning of
subtrees and requiring less work to identify matches.

Additionally this structure usually uses much less memory as common
prefixes of paths are only stored once.

Testing on benchmarks and use cases shows a 2-10x performance
improvement for common scenarios. For some edge cases effectively
infinite speedup can be seen as huge numbers of paths can be
pruned. This is particularly common with a lot of hidden paths.

One side note is that I kept the threading feature in although in
practice it seems to provide a very small improvement and only on very
large sets (>1M). Future work will to investigate why there is little
benefit and either fix or remove the threading. (Removing provides a
small performance boost as some checks can be removed).

@wincent
Copy link
Owner

wincent commented Jun 18, 2017

Can't really review this on my phone but looks pretty epic. The trie makes sense: trade a bit of overhead for its construction in exchange for cheaper search space pruning during matching. Great stuff. Thanks for contributing it.

@wincent
Copy link
Owner

wincent commented Jun 22, 2017

@kevincox: would you be able to rebase this against "master"?

@kevincox
Copy link
Author

If you are interested in merging I gladly will. However it is Dependant on the C Scanner Infrastructure so that will have to be merged first. If that sounds good let me know. I should have some free time this weekend.

@wincent
Copy link
Owner

wincent commented Jun 22, 2017

Yeah, definitely interested. I just want to confirm your benchmark results on top of current master before going too far.

@kevincox
Copy link
Author

Well I'll try to start porting to current master. Also I realized the reason I wasn't seeing much parallel speedup was because the benchmarks had no limit. Adding a limit the multithreading is quite effective.

kevincox added 3 commits June 24, 2017 11:10
This scanner allows low overhead reading of files. It can currently read
paths from an array of strings (good for ruby-based scanners) and a
terminated string (great for command executor scanners).

This also converts the find scanner and the git scanner to use the
string based infrastructure.
This option provides the user with ultimate flexibility. They have the
choice to provide any shell command, script or pipeline in order to
generate a file list.

To give the command more context in the future variables can be set before
calling the user command in the shell to pass more information to the external
command. This allows future backwards compatible extension.

The only drawback I see is that the terminator is "static" for example if
a user has a script that switches the technique used based on context
(for example current path) it will always have to use the same
terminator. Luckily the workaround is simple by using `| tr '\n' '\0'`
to convert newlines (or whatever the alternative separator is) to nulls
and setting CommandT to always use nulls. This has a slight CPU and
memory overhead but since it runs in parallel it should be insignificant
on a multi-core machine.
This improves matching performance by using a trie to store the paths.
Hit-or-miss matching is then done iteratively, allowing the pruning of
subtrees and requiring less work to identify matches.

Additionally this structure usually uses much less memory as common
prefixes of paths are only stored once.

Testing on benchmarks and use cases shows a 2-10x performance
improvement for common scenarios. For some edge cases effectively
infinite speedup can be seen as huge numbers of paths can be pruned.
This is particularly common with a lot of hidden paths.

Other notes:
- A tiny cache boost might be gained by putting all strings into the same buffer
	Instead of strdup'ing strings separately.
- I considered putting the path segments inline into the paths_t
	structure but it performed worse. This was surprising but
	probably due to good cache prediction on the strings being in
	order in memory.
- Most of the time is still spent in calculate_match. While we call this
	function much less now (on every match rather then every string)
	it is still expensive. While this might be acceptable because
	its complexity provides a useful result order it is a good
	optimization target.
- Another boost might be gained by post-processing the paths into a
	single array that stores the delta from the previous. This would
	give excellent cache-locality but would make skipping over
	hidden or mask-failing files difficult/impossible. This would
	also be difficult/impossible to do threaded.
Without a limit the time is dominated by sorting all the results in a
single thread. The limit is configurable so it can be raised for testing
merging.
@kevincox
Copy link
Author

The rebase was simpler then expected. This is now based on master.

@wincent
Copy link
Owner

wincent commented Jun 27, 2017

Thanks, @kevincox. So I had a chance to play around with this. The "big" matcher benchmark is indeed a bit faster, but I found in practice that on a large repo (1.5m files) this is significantly slower overall. How large a repo have you been testing this on? I think the approach here shows promise, but it will need to meet and improve the bar on large repos in order to be useful (because the truth is, the perf doesn't matter on small repos, because anything is fast enough there).

@kevincox
Copy link
Author

That's odd. I was trying it on a set of ~4M files (my real world use case) and it was significantly faster. Can you explain exactly what your setup is.

  • What do the files look like. Are they a shallow set or deeply nested?
  • Are you using threads? And how many cores does your system have?
    • What is the difference with and without threads?
  • Are you setting a limit on the number of results? How large is it?
  • What OS are you on?
  • What compiler and optimization level are you using?
  • Could you give me concrete numbers? I'm wondering it both before and after are fast or slow.
  • What does your query look like and how much of the set is it actually matching and how selective is the mask?

@kevincox
Copy link
Author

Hmm, it actually appears like there was a regression in the cleanup. I'll look into it.

However even still the performance is better then master for me with ~2M paths and a real query. The numbers below are running the matcher benchmark as it exists in the latest commit in this branch (except master had to pass the paths differently) on a 6 physical 12 logical core linux system.

# master
                     best      avg       sd   +/- p   (best)      (avg)      (sd)   +/- p
     pathological   1.13000   1.22750 0.06418  ?     (1.12612)  (1.22599) (0.06424)  ?   
        command-t   0.89000   0.90750 0.03031  ?     (0.88601)  (0.90683) (0.03032)  ?   
chromium (subset)   3.79000   3.80500 0.01118  ?     (1.03332)  (1.07116) (0.04665)  ?   
 chromium (whole)   4.29000   4.37250 0.07361  ?     (0.58969)  (0.64701) (0.05316)  ?   
       big (400k)   6.66000   6.67750 0.01479  ?     (0.90809)  (0.96113) (0.05408)  ?   
             real 385.38000 386.36750 0.70892  ?    (46.41973) (47.30632) (0.54375)  ?   
# faster-filter
                     best      avg       sd    +/- p   (best)      (avg)      (sd)   +/- p
     pathological   1.75000   2.10500  0.27409  ?     (1.74928)  (2.10546) (0.27541)  ?   
        command-t   1.39000   1.49500  0.07921  ?     (1.38612)  (1.49320) (0.07835)  ?   
chromium (subset)   2.83000   3.15750  0.23317  ?     (2.82870)  (3.14835) (0.22487)  ?   
 chromium (whole)   4.22000   4.45000  0.33786  ?     (0.60953)  (0.67107) (0.06295)  ?   
       big (400k)   5.28000   5.73500  0.38817  ?     (0.84286)  (0.93130) (0.09734)  ?   
             real 336.87000 367.88250 30.48526  ?    (42.95834) (46.65727) (3.38012)  ?   

@kevincox
Copy link
Author

kevincox commented Aug 6, 2017

Sorry for the delay, life gets in the way.

I looked into it and it appears that the reason the benchmarks look very similar is because this new branch is slightly slower when everything matches. This is because because every item still gets scored then sorted, and scoring is expensive. However the old method scores right away so it is doing the minimal amount of work (not counting cache differences), however it lacks the fast path.

Once your selector has much selectivity at all the new version is significantly faster as only matching entries are scored. The benchmark.rb currently benchmarks every prefix of the query so it spends most of it's time on the short non-selective queries.

I suspect that this isn't too useful of a measure because I tend to type a number of characters before I wait for the result. So while the first couple of searches are roughtly the same performance it doesn't matter to me because I only start using the results once I have typed a number of characters and the result set is significantly reduced.

Benchmark just using the full query:

# master
                    best    avg      sd   +/- p   (best)    (avg)      (sd)   +/- p
     pathological 0.05000 0.05733 0.00573  ?    (0.05428) (0.05838) (0.00570)  ?   
        command-t 0.07000 0.07467 0.00618  ?    (0.07070) (0.07511) (0.00282)  ?   
chromium (subset) 0.14000 0.15400 0.00712  ?    (0.03793) (0.04515) (0.00383)  ?   
 chromium (whole) 0.30000 0.31333 0.00699  ?    (0.05481) (0.05879) (0.00315)  ?   
       big (400k) 0.44000 0.46200 0.01046  ?    (0.07632) (0.08399) (0.00327)  ?   
# faster-filter
Summary of cpu time and (wall-clock time):
                    best    avg      sd   +/- p   (best)    (avg)      (sd)   +/- p
     pathological 0.10000 0.18800 0.03563  ?    (0.09133) (0.18863) (0.03951)  ?   
        command-t 0.06000 0.06600 0.00490  ?    (0.06305) (0.06872) (0.00324)  ?   
chromium (subset) 0.04000 0.06333 0.00869  ?    (0.04379) (0.05980) (0.00785)  ?   
 chromium (whole) 0.02000 0.03733 0.00772  ?    (0.00724) (0.00854) (0.00077)  ?   
       big (400k) 0.06000 0.06400 0.00490  ?    (0.01074) (0.01278) (0.00100)  ?   

@wincent
Copy link
Owner

wincent commented Aug 18, 2017

Interesting. Thanks for the analysis.

The benchmark.rb currently benchmarks every prefix of the query

Yep, this is pretty intentional because it reflects what I think is a fairly common use pattern. (Although I totally take your point about your preference for typing in bursts, and relying on debouncing to prevent intermediate searches from being dispatched.)

A couple of thoughts:

  • I wonder whether it would make sense to have a hybrid strategy that uses the method that is faster for "short non-selective queries", then cuts over to the other strategy (which your benchmarks seem to show is faster for long, selective queries) at some point. Obviously, would add some complexity, and the pay-off may not sufficiently justify it.
  • You mention that "every item still gets scored then sorted, and scoring is expensive". There is some nuance to that. Scoring is costly, but sorting is too (got a big speed-up from reducing the sorting workload by switching to a heap-based strategy in 03aedee, and then another substantial bump when we avoided some unnecessary heap operations in 34b03d5). Commit ee97325 was another big bump because it makes scoring basically free after the first non-match (that is, if you search for "abc" and it doesn't match, when you add another character and search for "abcd" we don't bother scoring at all seeing as we know it can't possibly match, meaning we can cheaply prune the search space much like you are with the trie, at least for this case).

Thanks for you work and analysis on this. I'm not going to pull the trigger on this yet because I do think we need to solve the sluggish feel on large repos when not using g:CommandTInputDebounce before we merge it.

@kevincox
Copy link
Author

The benchmark.rb currently benchmarks every prefix of the query
Yep, this is pretty intentional...

I figured it was, and I knew about it but it slipped my mind. I think in the future it would be benficial to have a couple of different benchmarks so that it is easier to identify where the performance changes, but this is probably sufficent.

I wonder whether it would make sense to have a hybrid strategy...

I'm not sure it is. I very rarely use non-selective queries. Also I suspect the less selective your query the more you are relying on scoring to suface an interesting result. Searching a large corpus with a non-selective query doesn't seem particularly useful otherwise.

There is some low-hanging fruit in this change to make the empty string only use on thread and just read off the first limit items. Currently it spawns all the threads and scans everything. However since the scoring has a short circuit for the empty needle it is still fairly fast (~30% of time is in scoring) and I didn't implement it. (It's probably enought to just set sort = false, threads = 1 and this will work).

Scoring is costly, but sorting is too.

The first two changes are preserved, but I haven't looked into the partial results one. It could probably save some cycles on the incremental case so I'll look into it. It is unclear if it will be as clear of a win in this case because we already do some pruning, however more pruning is unlikely to hurt.

the sluggish feel on large repos when not using g:CommandTInputDebounce

I haven't tried this, I'll play with the setting and see the effect.

@wincent
Copy link
Owner

wincent commented Aug 18, 2017

I'm not sure it is. I very rarely use non-selective queries. Also I suspect the less selective your query the more you are relying on scoring to suface an interesting result. Searching a large corpus with a non-selective query doesn't seem particularly useful otherwise.

In order to actually find the needle in the haystack you're generally going to need something pretty "selective", as you say, but that doesn't mean that performance of "non-selective" queries doesn't matter. The benchmarks don't tell the full story; this is why I commented above "I found in practice that on a large repo (1.5m files) this is significantly slower overall" — it felt like treacle, not a subtle effect at all. I didn't need to run the benchmarks to see the regression, and I bet other people working in huge repos would have seen it too; this isn't a true edge-case like the "pathological" dataset in the benchmarks, but the really common starting-to-search case, so it's pretty important that we don't regress it.

[Edit: is → isn't]

@kevincox
Copy link
Author

kevincox commented Aug 18, 2017

I found in practice that on a large repo (1.5m files) this is significantly slower overall

I wonder what makes your large repo test slow. Because I tested with 0.5M, 2M and 4M and saw only minor differences (< 20% for empty queries). If you can figure out why your use case is so slow it might be possible to optimize it.

I have been using this since before I opened the request originaly with the 2M, but later cut down to 0.5M repo and it feels significantly faster for me in both cases.

it's pretty important that we don't regress it.

For sure, I think it should be possible to make this change with no, or only very minor regressions in the worst cases.

@kevincox
Copy link
Author

I'll add a little checklist for planned improvements:

  • Cache results of previous search. This will allow faster pruning when search is strictly more selective. Restricts to a single concurrent reader but that is probably fine.
  • Seperate benchmarks for empty, non-selective, selective and extending cases.

@wincent wincent changed the base branch from master to main June 7, 2021 22:34
@wincent
Copy link
Owner

wincent commented Aug 26, 2022

Given the big rewrite for v6.0.x, older PRs will need to be re-rolled against main and target the new version, so I'm closing this one out. Feedback issue for 6.0.x is here:

@wincent wincent closed this Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants