-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fuzzy search #281
Add fuzzy search #281
Conversation
Hi, I am really not sure about benefits of the fuzzy search at the moment. For me it feels really confusing, as I expect index search to do exact matching. I have tried your PR on my Linux machine. Our Windows build fails as you can see above. My first thought was, that search is completely broken. It takes 7-15 seconds to display any results. That's really long. Another problem is highlighting. For example, if I search for Perhaps, there's something more, but the current performance makes any further testing too slow. Any ideas what causes such slow down? |
@trollixx I'm getting queries in 1s or less with 7 docsets. Which ones are you using? I'm going to do some profiling to see how to improve it, I'm sure I could use some better data structures and C++11x move semantics to reduce allocations. The approach using longest common subsequences isn't unique, and enumerating them all to find the one with highest relevancy isn't polynomial time. However doing a quick check to see if it can highlight the exact word wouldn't be a noticeable time penalty and take care of cases like SVGString I played around with using a virtual table as a token trigram index (similar to something like pg_trgm + GIN indexing for postgres), and while search was really fast it didn't really give good relevancy for something as specific as code search. I don't think dealing with transpositions and deletions make sense when you're searching for method names. |
Took a slightly different approach using LIKE as prefilter and got a big speedup while removing the dependency on the sqlite headers. Also improved presentation of cases like SVGString. What do you think? |
@joaomsa this looks pretty great. Needs master merge :( Also what did you use to make your little gif screencast? |
@johntyree |
I like the idea, but I'd prefer to postpone merging for a little while. I am going to do a major refactoring after the next release (coming this or the next week), that would allow to extend Zeal to handle more different documentation formats. This change is kinda intrusive, so it would be better to merge, once the code related to Dash docsets is encapsulated in one place. |
@johntyree I think that after the coming refactoring this patch could be reduced and simplified. If @joaomsa will not feel like updating his code, no problem, I am fine with doing it. Sorry for possible inconvenience, I am just trying to make one step a time as part of bringing Zeal quality to higher levels. |
@trollixx No problem I'll wait for the refactoring. Another thing I've also been playing around with is doing search in the background and updating the results list incrementally. Seems to really improve responsiveness in cases where you quickly want results for smaller docsets while big ones like Android may still not have returned. I'll put that as another PR in the future for your consideration since this one is already pretty big and I haven't figured out a satisfying way to handle cursor jumping in the results list as more results pour in. |
Sounds interesting. I have also started moving all database related stuff in per docset threads, but decided to hold it until I have a proper docset format abstraction. |
d663db0
to
e3c255a
Compare
I've rebased this patchset against the latest master and improved it a bit more, is there still interest in this feature? |
I'm interested! |
Me too! |
I can confirm that this version compiles and works perfectly. There are also no performance issues (as in, no noticeable slowdown in searching). This should definitely be considered for inclusion in Zeal. |
Looks great! Tried it out here and the performance was good, even though I was on a VM. |
Once #460 lands, I'll rebase this as a DocsetSearchStrategy. |
what's the status on this? |
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR #281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Uses an O(m+n) algorithm based on https://github.com/bevacqua/fuzzysearch - should be faster than the one initially proposed in PR zealdocs#281.
Closing in favour of #614. Regardless thanks for the contribution! |
Adds a sqlite extension with a ZRELEVANCY function that takes an item and query, and produces a similarity measure. it only allows matches where the query is a subsequence of the item. I'm getting great results in terms of relevancy and speed based on spread and density of the subsequence matched. like the suggestion on #100:
Also involved tweaking the result tokenization/normalization code as i ran into a bunch of edge cases now that it's also applied to the query for consistency.